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Preface 


This  book  provides  a  detailed  treatment  of  microeconometric  analysis,  the  analysis  of 
individual-level  data  on  the  economic  behavior  of  individuals  or  firms.  This  type  of 
analysis  usually  entails  applying  regression  methods  to  cross-section  and  panel  data. 

The  book  aims  at  providing  the  practitioner  with  a  comprehensive  coverage  of  sta¬ 
tistical  methods  and  their  application  in  modern  applied  microeconometrics  research. 
These  methods  include  nonlinear  modeling,  inference  under  minimal  distributional 
assumptions,  identifying  and  measuring  causation  rather  than  mere  association,  and 
correcting  departures  from  simple  random  sampling.  Many  of  these  features  are  of 
relevance  to  individual-level  data  analysis  throughout  the  social  sciences. 

The  ambitious  agenda  has  determined  the  characteristics  of  this  book.  First,  al¬ 
though  oriented  to  the  practitioner,  the  book  is  relatively  advanced  in  places.  A  cook¬ 
book  approach  is  inadequate  because  when  two  or  more  complications  occur  simulta¬ 
neously  -  a  common  situation  -  the  practitioner  must  know  enough  to  be  able  to  adapt 
available  methods.  Second,  the  book  provides  considerable  coverage  of  practical  data 
problems  (see  especially  the  last  three  chapters).  Third,  the  book  includes  substantial 
empirical  examples  in  many  chapters  to  illustrate  some  of  the  methods  covered.  Fi¬ 
nally,  the  book  is  unusually  long.  Despite  this  length  we  have  been  space-constrained. 
We  had  intended  to  include  even  more  empirical  examples,  and  abbreviated  presen¬ 
tations  will  at  times  fail  to  recognize  the  accomplishments  of  researchers  who  have 
made  substantive  contributions. 

The  book  assumes  a  good  understanding  of  the  linear  regression  model  with  matrix 
algebra.  It  is  written  at  the  mathematical  level  of  the  first-year  economics  Ph.D.  se¬ 
quence,  comparable  to  Greene  (2003).  We  have  two  types  of  readers  in  mind.  First,  the 
book  can  be  used  as  a  course  text  for  a  microeconometrics  course,  typically  taught  in 
the  second  year  of  the  Ph.D.,  or  for  data-oriented  microeconomics  field  courses  such 
as  labor  economics,  public  economics,  and  industrial  organization.  Second,  the  book 
can  be  used  as  a  reference  work  for  graduate  students  and  applied  researchers  who 
despite  training  in  microeconometrics  will  inevitably  have  gaps  that  they  wish  to  fill. 

For  instructors  using  this  book  as  an  econometrics  course  text  it  is  best  to  introduce 
the  basic  nonlinear  cross-section  and  linear  panel  data  models  as  early  as  possible, 


PREFACE 


initially  skipping  many  of  the  methods  chapters.  The  key  methods  chapter  (Chapter  5) 
covers  maximum-likelihood  and  nonlinear  least-squares  estimation.  Knowledge  of 
maximum  likelihood  and  nonlinear  least-squares  estimators  provides  adequate  back¬ 
ground  for  the  most  commonly  used  nonlinear  cross-section  models  (Chapters  14-17 
and  20),  basic  linear  panel  data  models  (Chapter  21),  and  treatment  evaluation  meth¬ 
ods  (Chapter  25).  Generalized  method  of  moments  estimation  (Chapter  6)  is  needed 
especially  for  advanced  linear  panel  data  methods  (Chapter  22). 

For  readers  using  this  book  as  a  reference  work,  the  chapters  have  been  written  to  be 
as  self-contained  as  possible.  The  notable  exception  is  that  some  command  of  general 
estimation  results  in  Chapter  5,  and  occasionally  Chapter  6,  will  be  necessary.  Most 
chapters  on  models  are  structured  to  begin  with  a  discussion  and  example  that  is  acces¬ 
sible  to  a  wide  audience. 

The  Web  site  www.econ.ucdavis.edu/faculty/cameron  provides  all  the  data  and 
computer  programs  used  in  this  book  and  related  materials  useful  for  instructional 
purposes. 

This  project  has  been  long  and  arduous,  and  at  times  seemingly  without  an  end.  Its 
completion  has  been  greatly  aided  by  our  colleagues,  friends,  and  graduate  students. 
We  thank  especially  the  following  for  reading  and  commenting  on  specific  chapters: 
Bijan  Borah,  Kurt  Brannas,  Pian  Chen,  Tim  Cogley,  Partha  Deb,  Massimiliano  De 
Santis,  David  Drukker,  Jeff  Gill,  Tue  Gorgens,  Shiferaw  Gurmu,  Lu  Ji,  Oscar  Jorda, 
Roger  Koenker,  Chenghui  Li,  Tong  Li,  Doug  Miller,  Murat  Munkin,  Jim  Prieger, 
Ahmed  Rahmen,  Sunil  Sapra,  Haruki  Seitani,  Yacheng  Sun,  Xiaoyong  Zheng,  and 
David  Zimmer.  Pian  Chen  gave  detailed  comments  on  most  of  the  book.  We  thank 
Rajeev  Dehejia,  Bronwyn  Hall,  Cathy  Kling,  Jeffrey  Kling,  Will  Manning,  Brian 
McCall,  and  Jim  Ziliak  for  making  their  data  available  for  empirical  illustrations.  We 
thank  our  respective  departments  for  facilitating  our  collaboration  and  for  the  produc¬ 
tion  and  distribution  of  the  draft  manuscript  at  various  stages.  We  benefited  from  the 
comments  of  two  anonymous  reviewers.  Guidance,  advice,  and  encouragement  from 
our  Cambridge  editor,  Scott  Parris,  have  been  invaluable. 

Our  interest  in  econometrics  owes  much  to  the  training  and  environments  we  en¬ 
countered  as  students  and  in  the  initial  stages  of  our  academic  careers.  The  first  author 
thanks  The  Australian  National  University;  Stanford  University,  especially  Takeshi 
Amemiya  and  Tom  MaCurdy ;  and  The  Ohio  State  University.  The  second  author  thanks 
the  London  School  of  Economics  and  The  Australian  National  University. 

Our  interest  in  writing  a  book  oriented  to  the  practitioner  owes  much  to  our  exposure 
to  the  research  of  graduate  students  and  colleagues  at  our  respective  institutions,  UC- 
Davis  and  IU-Bloomington. 

Finally,  we  thank  our  families  for  their  patience  and  understanding  without  which 
completion  of  this  project  would  not  have  been  possible. 

A.  Colin  Cameron 
Davis,  California 

Pravin  K.  Trivedi 
Bloomington,  Indiana 


xxii 


PART  ONE 


Preliminaries 


Part  1  covers  the  essential  components  of  microeconometric  analysis  -  an  economic 
specification,  a  statistical  model  and  a  data  set. 

Chapter  1  discusses  the  distinctive  aspects  of  microeconometrics,  and  provides  an 
outline  of  the  book.  It  emphasizes  that  discreteness  of  data,  and  nonlinearity  and  het¬ 
erogeneity  of  behavioral  relationships  are  key  aspects  of  individual-level  microecono¬ 
metric  models.  It  concludes  by  presenting  the  notation  and  conventions  used  through¬ 
out  the  book. 

Chapters  2  and  3  set  the  scene  for  the  remainder  of  the  book  by  introducing  the 
reader  to  key  model  and  data  concepts  that  shape  the  analyses  of  later  chapters. 

A  key  distinction  in  econometrics  is  between  essentially  descriptive  models  and 
data  summaries  at  various  levels  of  statistical  sophistication  and  models  that  go  be¬ 
yond  associations  and  attempt  to  estimate  causal  parameters.  The  classic  definitions 
of  causality  in  econometrics  derive  from  the  Cowles  Commission  simultaneous  equa¬ 
tions  models  that  draw  sharp  distinctions  between  exogenous  and  endogenous  vari¬ 
ables,  and  between  structural  and  reduced  form  parameters.  Although  reduced  form 
models  are  very  useful  for  some  purposes,  knowledge  of  structural  or  causal  parame¬ 
ters  is  essential  for  policy  analyses.  Identification  of  structural  parameters  within  the 
simultaneous  equations  framework  poses  numerous  conceptual  and  practical  difficul¬ 
ties.  An  increasingly-used  alternative  approach  based  on  the  potential  outcome  model, 
also  attempts  to  identify  causal  parameters  but  it  does  so  by  posing  limited  questions 
within  a  more  manageable  framework.  Chapter  2  attempts  to  provide  an  overview  of 
the  fundamental  issues  that  arise  in  these  and  other  alternative  frameworks.  Readers 
who  initially  find  this  material  challenging  should  return  to  this  chapter  after  gaining 
greater  familiarity  with  specific  models  covered  later  in  the  book. 

The  empirical  researcher’s  ability  to  identify  causal  parameters  depends  not  only 
on  the  statistical  tools  and  models  but  also  on  the  type  of  data  available.  An  experi¬ 
mental  framework  provides  a  standard  for  establishing  causal  connections.  However, 
observational,  not  experimental,  data  form  the  basis  of  much  of  econometric  inference. 
Chapter  3  surveys  the  pros  and  cons  of  three  main  types  of  data:  observational  data, 
data  from  social  experiments,  and  data  from  natural  experiments.  The  strengths  and 
weaknesses  of  conducting  causal  inference  based  on  each  type  of  data  are  reviewed. 
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CHAPTER  1 


Overview 


1.1.  Introduction 

This  book  provides  a  detailed  treatment  of  microeconometric  analysis,  the  analysis 
of  individual-level  data  on  the  economic  behavior  of  individuals  or  firms.  A  broader 
definition  would  also  include  grouped  data.  Usually  regression  methods  are  applied  to 
cross-section  or  panel  data. 

Analysis  of  individual  data  has  a  long  history.  Ernst  Engel  (1857)  was  among  the 
earliest  quantitative  investigators  of  household  budgets.  Allen  and  Bowley  (1935), 
Houthakker  (1957),  and  Prais  and  Houthakker  (1955)  made  important  contributions 
following  the  same  research  and  modeling  tradition.  Other  landmark  studies  that  were 
also  influential  in  stimulating  the  development  of  microeconometrics,  even  though 
they  did  not  always  use  individual-level  information,  include  those  by  Marschak  and 
Andrews  (1944)  in  production  theory  and  by  Wold  and  Jureen  (1953),  Stone  (1953), 
and  Tobin  (1958)  in  consumer  demand. 

As  important  as  the  above  earlier  cited  work  is  on  household  budgets  and  demand 
analysis,  the  material  covered  in  this  book  has  stronger  connections  with  the  work  on 
discrete  choice  analysis  and  censored  and  truncated  variable  models  that  saw  their  first 
serious  econometric  applications  in  the  work  of  McFadden  (1973, 1984)  and  Heckman 
(1974,  1979),  respectively.  These  works  involved  a  major  departure  from  the  over¬ 
whelming  reliance  on  linear  models  that  characterized  earlier  work.  Subsequently,  they 
have  led  to  significant  methodological  innovations  in  econometrics.  Among  the  earlier 
textbook-level  treatments  of  this  material  (and  more)  are  the  works  of  Maddala  (1983) 
and  Amemiya  (1985).  As  emphasized  by  Heckman  (2001),  McFadden  (2001),  and  oth¬ 
ers,  many  of  the  fundamental  issues  that  dominated  earlier  work  based  on  market  data 
remain  important,  especially  concerning  the  conditions  necessary  for  identifiability  of 
causal  economic  relations.  Nonetheless,  the  style  of  microeconometrics  is  sufficiently 
distinct  to  justify  writing  a  text  that  is  exclusively  devoted  to  it. 

Modern  microeconometrics  based  on  individual-,  household-,  and  establishment- 
level  data  owes  a  great  deal  to  the  greater  availability  of  data  from  cross-section 
and  longitudinal  sample  surveys  and  census  data.  In  the  past  two  decades,  with  the 
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expansion  of  electronic  recording  and  collection  of  data  at  the  individual  level,  data 
volume  has  grown  explosively.  So  too  has  the  available  computing  power  for  analyzing 
large  and  complex  data  sets.  In  many  cases  event-level  data  are  available;  for  example, 
marketing  science  often  deals  with  purchase  data  collected  by  electronic  scanners  in 
supermarkets,  and  industrial  organization  literature  contains  econometric  analyses  of 
airline  travel  data  collected  by  online  booking  systems.  There  are  now  new  branches  of 
economics,  such  as  social  experimentation  and  experimental  economics,  that  generate 
“experimental”  data.  These  developments  have  created  many  new  modeling  opportu¬ 
nities  that  are  absent  when  only  aggregated  market-level  data  are  available.  Meanwhile 
the  explosive  growth  in  the  volume  and  types  of  data  has  also  given  rise  to  numerous 
methodological  issues.  Processing  and  econometric  analysis  of  such  large  microdata¬ 
bases,  with  the  objective  of  uncovering  patterns  of  economic  behavior,  constitutes  the 
core  of  microeconometrics.  Econometric  analysis  of  such  data  is  the  subject  matter  of 
this  book. 

Key  precursors  of  this  book  are  the  books  by  Maddala  (1983)  and  Amemiya  (1985). 
Like  them  it  covers  topics  that  are  presented  only  briefly,  or  not  at  all,  in  undergraduate 
and  first-year  graduate  econometrics  courses.  Especially  compared  to  Amemiya  (1985) 
this  book  is  more  oriented  to  the  practitioner.  The  level  of  presentation  is  nonetheless 
advanced  in  places,  especially  for  applied  researchers  in  disciplines  that  are  less  math¬ 
ematically  oriented  than  economics. 

A  relatively  advanced  presentation  is  needed  for  several  reasons.  First,  the  data  are 
often  discrete  or  censored,  in  which  case  nonlinear  methods  such  as  logit,  probit, 
and  Tobit  models  are  used.  This  leads  to  statistical  inference  based  on  more  difficult 
asymptotic  theory. 

Second,  distributional  assumptions  for  such  data  become  critically  important.  One 
response  is  to  develop  highly  parametric  models  that  are  sufficiently  detailed  to  capture 
the  complexities  of  data,  but  these  models  can  be  challenging  to  estimate.  A  more  com¬ 
mon  response  is  to  minimize  parametric  assumptions  and  perform  statistical  inference 
based  on  standard  errors  that  are  “robust”  to  complications  such  as  heteroskedasticity 
and  clustering.  In  such  cases  considerable  knowledge  can  be  needed  to  ensure  valid 
statistical  inference  even  if  a  standard  regression  package  is  used. 

Third,  economic  studies  often  aim  to  determine  causation  rather  than  merely  mea¬ 
sure  correlation,  despite  access  to  observational  rather  than  experimental  data.  This 
leads  to  methods  to  isolate  causation  such  as  instrumental  variables,  simultaneous 
equations,  measurement  error  correction,  selection  bias  correction,  panel  data  fixed 
effects,  and  differences-in-differences. 

Fourth,  microeconomic  data  are  typically  collected  using  cross-section  and  panel 
surveys,  censuses,  or  social  experiments.  Survey  data  collected  using  these  methods 
are  subject  to  problems  of  complex  survey  methodology,  departures  from  simple  ran¬ 
dom  sampling  assumptions,  and  problems  of  sample  selection,  measurement  errors, 
and  incomplete,  and/or  missing  data.  Dealing  with  such  issues  in  a  way  that  can  sup¬ 
port  valid  population  inferences  from  the  estimated  econometric  models  population 
requires  use  of  advanced  methods. 

Finally,  it  is  not  unusual  that  two  or  more  complications  occur  simultaneously, 
such  as  endogeneity  in  a  logit  model  with  panel  data.  Then  a  cookbook  approach 
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becomes  very  difficult  to  implement.  Instead,  considerable  understanding  of  the  the¬ 
ory  underlying  the  methods  is  needed,  as  the  researcher  may  need  to  read  econometrics 
journal  articles  and  adapt  standard  econometrics  software. 


1.2.  Distinctive  Aspects  of  Microeconometrics 

We  now  consider  several  advantages  of  microeconometrics  that  derive  from  its  distinc¬ 
tive  features. 


1.2.1.  Discreteness  and  Nonlinearity 

The  first  and  most  obvious  point  is  that  microeconometric  data  are  usually  at  a  low 
level  of  aggregation.  This  has  a  major  consequence  for  the  functional  forms  used  to 
analyze  the  variables  of  interest.  In  many,  if  not  most,  cases  linear  functional  forms 
turn  out  to  be  simply  inappropriate.  More  fundamentally,  disaggregation  brings  to  the 
forefront  heterogeneity  of  individuals,  firms,  and  organizations  that  should  be  prop¬ 
erly  controlled  (modeled)  if  one  wants  to  make  valid  inferences  about  the  underlying 
relationships.  We  discuss  these  issues  in  greater  detail  in  the  following  sections. 

Although  aggregation  is  not  entirely  absent  in  microdata,  as  for  example  when 
household-  or  establishment-level  data  are  collected,  the  level  of  aggregation  is  usu¬ 
ally  orders  of  magnitude  lower  than  is  common  in  macro  analyses.  In  the  latter  case  the 
process  of  aggregation  leads  to  smoothing,  with  many  of  the  movements  in  opposite 
directions  canceling  in  the  course  of  summation.  The  aggregated  variables  often  show 
smoother  behavior  than  their  components,  and  the  relationships  between  the  aggre¬ 
gates  frequently  show  greater  smoothness  than  the  components.  For  example,  a  rela¬ 
tion  between  two  variables  at  a  micro  level  may  be  piecewise  linear  with  many  nodes. 
After  aggregation  the  relationship  is  likely  to  be  well  approximated  by  a  smooth  func¬ 
tion.  Hence  an  immediate  consequence  of  disaggregation  is  the  absence  of  features  of 
continuity  and  smoothness  both  of  the  variables  themselves  and  of  the  relationships 
between  them. 

Usually  individual-  and  firm-level  data  cover  a  huge  range  of  variation,  both  in  the 
cross-section  and  time-series  dimensions.  For  example,  average  weekly  consumption 
of  (say)  beef  is  highly  likely  to  be  positive  and  smoothly  varying,  whereas  that  of  an  in¬ 
dividual  household  in  a  given  week  may  be  frequently  zero  and  may  also  switch  to  pos¬ 
itive  values  from  time  to  time.  The  average  number  of  hours  worked  by  female  workers 
is  unlikely  to  be  zero,  but  many  individual  females  have  zero  market  hours  of  work 
(corner  solutions),  switching  to  positive  values  at  other  times  in  the  course  of  their  la¬ 
bor  market  history.  Average  household  expenditure  on  vacations  is  usually  positive,  but 
many  individual  households  may  have  zero  expenditure  on  vacations  in  any  given  year. 
Average  per  capita  consumption  of  tobacco  products  will  usually  be  positive,  but  many 
individuals  in  the  population  have  never  consumed  these  products  and  never  will,  irre¬ 
spective  of  price  and  income  considerations.  As  Pudney  (1989)  has  observed,  micro¬ 
data  exhibit  “holes,  kinks  and  corners.”  The  holes  correspond  to  nonparticipation  in  the 
activity  of  interest,  kinks  correspond  to  the  switching  behavior,  and  corners  correspond 
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to  the  incidence  of  nonconsumption  or  nonparticipation  at  specific  points  of  time. 
That  is,  discreteness  and  nonlinearity  of  response  are  intrinsic  to  microeconometrics. 

An  important  class  of  nonlinear  models  in  microeconometrics  deals  with  limited 
dependent  variables  (Maddala,  1983).  This  class  includes  many  models  that  provide 
suitable  frameworks  for  analyzing  discrete  responses  and  responses  with  limited  range 
of  variation.  Such  tools  of  analyses  are  of  course  also  available  for  analyzing  macro¬ 
data,  if  required.  The  point  is  that  they  are  indispensable  in  microeconometrics  and 
give  it  its  distinctive  feature. 


1.2.2.  Greater  Realism 

Macroeconometrics  is  sometimes  based  on  strong  assumptions;  the  representative 
agent  assumption  is  a  leading  example.  A  frequent  appeal  is  made  to  microeconomic 
reasoning  to  justify  certain  specifications  and  interpretations  of  empirical  results.  How¬ 
ever,  it  is  rarely  possible  to  say  explicitly  how  these  are  affected  by  aggregation  over 
time  and  micro  units.  Alternatively,  very  extreme  aggregation  assumptions  are  made. 
For  example,  aggregates  are  said  to  reflect  the  behavior  of  a  hypothetical  representative 
agent.  Such  assumptions  also  are  not  credible. 

From  the  viewpoint  of  microeconomic  theory,  quantitative  analysis  founded  on 
microdata  may  be  regarded  as  more  realistic  than  that  based  on  aggregated  data.  There 
are  three  justifications  for  this  claim.  First,  the  measurement  of  the  variables  involved 
in  such  hypotheses  is  often  more  direct  (though  not  necessarily  free  from  measurement 
error)  and  has  greater  correspondence  to  the  theory  being  tested.  Second,  hypotheses 
about  economic  behavior  are  usually  developed  from  theories  of  individual  behavior.  If 
these  hypotheses  are  tested  using  aggregated  data,  then  many  approximations  and  sim¬ 
plifying  assumptions  have  to  be  made.  The  simplifying  assumption  of  a  representative 
agent  causes  a  great  loss  of  information  and  severely  limits  the  scope  of  an  empirical 
investigation.  Because  such  assumptions  can  be  avoided  in  microeconometrics,  and 
usually  are,  in  principle  the  microdata  provide  a  more  realistic  framework  for  testing 
microeconomic  hypotheses.  This  is  not  a  claim  that  the  promise  of  microdata  is  nec¬ 
essarily  achieved  in  empirical  work.  Such  a  claim  must  be  assessed  on  a  case-by-case 
basis.  Finally,  a  realistic  portrayal  of  economic  activity  should  accommodate  a  broad 
range  of  outcomes  and  responses  that  are  a  consequence  of  individual  heterogeneity 
and  that  are  predicted  by  underlying  theory.  In  this  sense  microeconomic  data  sets  can 
support  more  realistic  models. 

Microeconometric  data  are  often  derived  from  household  or  firm  surveys,  typically 
encompassing  a  wide  range  of  behavior,  with  many  of  the  behavioral  outcomes  tak¬ 
ing  the  form  of  discrete  or  categorical  responses.  Such  data  sets  have  many  awkward 
features  that  call  for  special  tools  in  the  formulation  and  analysis  that,  although  not 
entirely  absent  from  macroeconometric  work,  nevertheless  are  less  widely  used. 


1.2.3.  Greater  Information  Content 

The  potential  advantages  of  microdata  sets  can  be  realized  if  such  data  are  informa¬ 
tive.  Because  sample  surveys  often  provide  independent  observations  on  thousands  of 
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cross-sectional  units,  such  data  are  thought  to  be  more  informative  than  the  standard, 
usually  highly  serially  correlated,  macro  time  series  typically  consisting  of  at  most  a 
few  hundred  observations. 

As  will  be  explained  in  the  next  chapter,  in  practice  the  situation  is  not  so  clear-cut 
because  the  microdata  may  be  quite  noisy.  At  the  individual  level  many  (idiosyncratic) 
factors  may  play  a  large  role  in  determining  responses.  Often  these  cannot  be  observed, 
leading  one  to  treat  them  under  the  heading  of  a  random  component,  which  can  be  a 
very  large  part  of  observed  variation.  In  this  sense  randomness  plays  a  larger  role  in 
microdata  than  in  macrodata.  Of  course,  this  affects  measures  of  goodness  of  fit  of  the 
regressions.  Students  whose  initial  exposure  to  econometrics  comes  through  aggregate 
time-series  analysis  are  often  conditioned  to  see  large  R 2  values.  When  encountering 
cross-section  regressions  for  the  first  time,  they  express  disappointment  or  even  alarm 
at  the  “low  explanatory  power”  of  the  regression  equation.  Nevertheless,  there  remains 
a  strong  presumption  that,  at  least  in  certain  dimensions,  large  microdata  sets  are  highly 
informative. 

Another  qualification  is  that  when  one  is  dealing  with  purely  cross-section  data, 
very  little  can  be  said  about  the  intertemporal  aspects  of  relationships  under  study. 
This  particular  aspect  of  behavior  can  be  studied  using  panel  and  transition  data. 

In  many  cases  one  is  interested  in  the  behavioral  responses  of  a  specific  group  of 
economic  agents  under  some  specified  economic  environment.  One  example  is  the 
impact  of  unemployment  insurance  on  the  job  search  behavior  of  young  unemployed 
persons.  Another  example  is  the  labor  supply  responses  of  low-income  individuals 
receiving  income  support.  Unless  microdata  are  used  such  issues  cannot  be  addressed 
directly  in  empirical  work. 


1.2.4.  Microeconomic  Foundations 

Econometric  models  vary  in  the  explicit  role  given  to  economic  theory.  At  one  end  of 
the  spectrum  there  are  models  in  which  the  a  priori  theorizing  may  play  a  dominant 
role  in  the  specification  of  the  model  and  in  the  choice  of  an  estimation  procedure.  At 
the  other  end  of  the  spectrum  are  empirical  investigations  that  make  much  less  use  of 
economic  theory. 

The  goal  of  the  analysis  in  the  first  case  is  to  identify  and  estimate  fundamental 
parameters,  sometimes  called  deep  parameters,  that  characterize  individual  taste  and 
preferences  and/or  technological  relationships.  As  a  shorthand  designation,  we  call 
this  the  structural  approach.  Its  hallmark  is  a  heavy  dependence  on  economic  theory 
and  emphasis  on  causal  inference.  Such  models  may  require  many  assumptions,  such 
as  the  precise  specification  of  a  cost  or  production  function  or  specification  of  the 
distribution  of  error  terms.  The  empirical  conclusions  of  such  an  exercise  may  not 
be  robust  with  respect  to  the  departures  from  the  assumptions.  In  Section  2.4.4  we 
shall  say  more  about  this  approach.  At  the  present  stage  we  simply  emphasize  that  if 
the  structural  approach  is  implemented  with  aggregated  data,  it  will  yield  estimates 
of  the  fundamental  parameters  only  under  very  stringent  (and  possibly  unrealistic) 
conditions.  Microdata  sets  provide  a  more  promising  environment  for  the  structural 
approach,  essentially  because  they  permit  greater  flexibility  in  model  specification. 
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The  goal  of  the  analysis  in  the  second  case  is  to  model  relationship(s)  between  re¬ 
sponse  variables  of  interest  conditionally  on  variables  the  researcher  takes  as  given,  or 
exogenous.  More  formal  definitions  of  endogeneity  and  exogeneity  are  given  in  Chap¬ 
ter  2.  As  a  shorthand  designation,  we  call  this  a  reduced  form  approach.  The  essential 
point  is  that  reduced  form  analysis  does  not  always  take  into  account  all  causal  inter¬ 
dependencies.  A  regression  model  in  which  the  focus  is  on  the  prediction  of  y  given 
regressors  x,  and  not  on  the  causal  interpretation  of  the  regression  parameters,  is  often 
referred  to  as  a  reduced  form  regression.  As  will  be  explained  in  Chapter  2,  in  general 
the  parameters  of  the  reduced  form  model  are  functions  of  structural  parameters.  They 
may  not  be  interpretable  without  some  information  about  the  structural  parameters. 


1.2.5.  Disaggregation  and  Heterogeneity 

It  is  sometimes  said  that  many  problems  and  issues  of  macroeconometrics  arise  from 
serial  correlation  of  macro  time  series,  and  those  of  microeconometrics  arise  from 
heteroskedasticity  of  individual-level  data.  Although  this  is  a  useful  characterization  of 
the  modeling  effort  in  many  microeconometric  analyses,  it  needs  amplification  and  is 
subject  to  important  qualifications.  In  a  range  of  microeconometric  models,  modeling 
of  dynamic  dependence  may  be  an  important  issue. 

The  benefits  of  disaggregation,  which  were  emphasized  earlier  in  this  section,  come 
at  a  cost:  As  the  data  become  more  disaggregated  the  importance  of  controlling  for 
interindividual  heterogeneity  increases.  Heterogeneity,  or  more  precisely  unobserved 
heterogeneity,  plays  a  very  important  role  in  microeconometrics.  Obviously,  many 
variables  that  reflect  interindividual  heterogeneity,  such  as  gender,  race,  educational 
background,  and  social  and  demographic  factors,  are  directly  observed  and  hence  can 
be  controlled  for.  In  contrast,  differences  in  individual  motivation,  ability,  intelligence, 
and  so  forth  are  either  not  observed  or,  at  best,  imperfectly  observed. 

The  simplest  response  is  to  ignore  such  heterogeneity,  that  is,  to  absorb  it  into  the 
regression  disturbance.  After  all  this  is  how  one  treats  the  myriad  small  unobserved 
factors.  This  step  of  course  increases  the  unexplained  part  of  the  variation.  More  seri¬ 
ously,  ignoring  persistent  interindividual  differences  leads  to  confounding  with  other 
factors  that  are  also  sources  of  persistent  interindividual  differences.  Confounding  is 
said  to  occur  when  the  individual  contributions  of  different  regressors  (predictor  vari¬ 
ables)  to  the  variation  in  the  variable  of  interest  cannot  be  statistically  separated.  Sup¬ 
pose,  for  example,  that  the  factor  x\  (schooling)  is  said  to  be  the  source  of  variation  in 
y  (earnings),  when  another  variable  X2  (ability),  which  is  another  source  of  variation, 
does  not  appear  in  the  model.  Then  that  part  of  total  variation  that  is  attributable  to 
the  second  variable  may  be  incorrectly  attributed  to  the  first  variable.  Intuitively,  their 
relative  importances  are  confounded.  A  leading  source  of  confounding  bias  is  the  in¬ 
correct  omission  of  regressors  from  the  model  and  the  inclusion  of  other  variables  that 
are  proxies  for  the  omitted  variable. 

Consider,  for  example,  the  case  in  which  a  program  participation  (0/1  dummy) 
variable  D  is  included  in  the  regression  mean  function  with  a  vector  of  regressors  x, 

y  =  x'{3  +  aD  +  u,  (1.1) 
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where  u  is  an  error  term.  The  term  “treatment”  is  used  in  biological  and  experimental 
sciences  to  refer  to  an  administered  regimen  involving  participants  in  some  trial.  In 
econometrics  it  commonly  refers  to  participation  in  some  activity  that  may  impact  an 
outcome  of  interest.  This  activity  may  be  randomly  assigned  to  the  participants  or  may 
be  self-selected  by  the  participant.  Thus,  although  it  is  acknowledged  that  individuals 
choose  their  years  of  schooling,  one  still  thinks  of  years  of  schooling  as  a  “treatment” 
variable.  Suppose  that  program  participation  is  taken  to  be  a  discrete  variable.  The 
coefficient  a  of  the  “treatment  variable”  measures  the  average  impact  of  the  program 
participation  (D  =  1),  conditional  on  covariates.  If  one  does  not  control  for  unob¬ 
served  heterogeneity,  then  a  potential  ambiguity  affects  the  interpretation  of  the  results. 
If  d  is  found  to  have  a  significant  impact,  then  the  following  question  arises:  Is  a  sig¬ 
nificantly  different  from  zero  because  D  is  correlated  with  some  unobserved  variable 
that  affects  y  or  because  there  is  a  causal  relationship  between  D  and  yl  For  example, 
if  the  program  considered  is  university  education,  and  the  covariates  do  not  include  a 
measure  of  ability,  giving  a  fully  causal  interpretation  becomes  questionable.  Because 
the  issue  is  important,  more  attention  should  be  given  to  how  to  control  for  unobserved 
heterogeneity. 

In  some  cases  where  dynamic  considerations  are  involved  the  type  of  data  available 
may  put  restrictions  on  how  one  can  control  for  heterogeneity.  Consider  the  example 
of  two  households,  identical  in  all  relevant  respects  except  that  one  exhibits  a  sys¬ 
tematically  higher  preference  for  consuming  good  A.  One  could  control  for  this  by 
allowing  individual  utility  functions  to  include  a  heterogeneity  parameter  that  reflects 
their  different  preferences.  Suppose  now  that  there  is  a  theory  of  consumer  behavior 
that  claims  that  consumers  become  addicted  to  good  A,  in  the  sense  that  the  more  they 
consume  of  it  in  one  period,  the  greater  is  the  probability  that  they  will  consume  more 
of  it  in  the  future.  This  theory  provides  another  explanation  of  persistent  interindi¬ 
vidual  differences  in  the  consumption  of  good  A.  By  controlling  for  heterogeneous 
preferences  it  becomes  possible  to  test  which  source  of  persistence  in  consumption  - 
preference  heterogeneity  or  addiction  -  accounts  for  different  consumption  patterns. 
This  type  of  problem  arises  whenever  some  dynamic  element  generates  persistence 
in  the  observed  outcomes.  Several  examples  of  this  type  of  problem  arise  in  various 
places  in  the  book. 

A  variety  of  approaches  for  modeling  heterogeneity  coexist  in  microeconometrics. 
A  brief  mention  of  some  of  these  follows,  with  details  postponed  until  later. 

An  extreme  solution  is  to  ignore  all  unobserved  interindividual  differences.  If  unob¬ 
served  heterogeneity  is  uncorrelated  with  observed  heterogeneity,  and  if  the  outcome 
being  studied  has  no  intertemporal  dependence,  then  the  aforementioned  problems  will 
not  arise.  Of  course,  these  are  strong  assumptions  and  even  with  these  assumptions  not 
all  econometric  difficulties  disappear. 

One  approach  for  handling  heterogeneity  is  to  treat  it  as  a  fixed  effect  and  to  esti¬ 
mate  it  as  a  coefficient  of  an  individual  specific  0/1  dummy  variable.  For  example,  in 
a  cross-section  regression,  each  micro  unit  is  allowed  its  own  dummy  variable  (inter¬ 
cept).  This  leads  to  an  extreme  proliferation  of  parameters  because  when  a  new  individ¬ 
ual  is  added  to  the  sample,  a  new  intercept  parameter  is  also  added.  Thus  this  approach 
will  not  work  if  our  data  are  cross  sectional.  The  availability  of  multiple  observations 
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per  individual  unit,  most  commonly  in  the  form  of  panel  data  with  T  time-series  ob¬ 
servations  for  each  of  the  N  cross-section  units,  makes  it  possible  to  either  estimate 
or  eliminate  the  fixed  effect,  for  example  by  first  differencing  if  the  model  is  linear 
and  the  fixed  effect  is  additive.  If  the  model  is  nonlinear,  as  is  often  the  case,  the  fixed 
effect  will  usually  not  be  additive  and  other  approaches  will  need  to  be  considered. 

A  second  approach  to  modeling  unobserved  heterogeneity  is  through  a  random  ef¬ 
fects  model.  There  are  a  number  of  ways  in  which  the  random  effects  model  can  be 
formulated.  One  popular  formulation  assumes  that  one  or  more  regression  parameters, 
often  just  the  regression  intercept,  varies  randomly  across  the  cross  section.  In  another 
formulation  the  regression  error  is  given  a  component  structure,  with  an  individual 
specific  random  component.  The  random  effects  model  then  attempts  to  estimate  the 
parameters  of  the  distribution  from  which  the  random  component  is  drawn.  In  some 
cases,  such  as  demand  analysis,  the  random  term  can  be  interpreted  as  random  prefer¬ 
ence  variation.  Random  effects  models  can  be  estimated  using  either  cross-section  or 
panel  data. 


1.2.6.  Dynamics 

A  very  common  assumption  in  cross-section  analysis  is  the  absence  of  intertempo¬ 
ral  dependence,  that  is,  an  absence  of  dynamics.  Thus,  implicitly  it  is  assumed  that 
the  observations  correspond  to  a  stochastic  equilibrium,  with  the  deviation  from  the 
equilibrium  being  represented  by  serially  independent  random  disturbances.  Even  in 
microeconometrics  for  some  data  situations  such  an  assumption  may  be  too  strong. 
For  example,  it  is  inconsistent  with  the  presence  of  serially  correlated  unobserved  het¬ 
erogeneity.  Dependence  on  lagged  dependent  variables  also  violates  this  assumption. 

The  foregoing  discussion  illustrates  some  of  the  potential  limitations  of  a  single 
cross-section  analysis.  Some  limitations  may  be  overcome  if  repeated  cross  sections 
are  available.  However,  if  there  is  dynamic  dependence,  the  least  problematic  approach 
might  well  be  to  use  panel  data. 


1.3.  Book  Outline 

The  book  is  split  into  six  parts.  Part  1  presents  the  issues  involved  in  microeconometric 
modeling.  Parts  2  and  3  present  general  theory  for  estimation  and  statistical  inference 
for  nonlinear  regression  models.  Parts  4  and  5  specialize  to  the  core  models  used  in 
applied  microeconometrics  for,  respectively,  cross-section  and  panel  data.  Part  6  covers 
broader  topics  that  make  considerable  use  of  material  presented  in  the  earlier  chapters. 

The  book  outline  is  summarized  in  Table  1.1.  The  remainder  of  this  section  details 
each  part  in  turn. 


1.3.1.  Part  1:  Preliminaries 

Chapters  2  and  3  expand  on  the  special  features  of  the  microeconometric  approach 
to  modeling  and  microeconomic  data  structures  within  the  more  general  statistical 
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Table  1.1.  Book  Outline 


Part  and  Chapter 

Background" 

Example 

1.  Preliminaries 

1.  Overview 

- 

2.  Causal  and  Noncausal  Models 

- 

Simultaneous  equations  models 

3.  Microeconomic  Data 

- 

Observational  data 

Structures 

2.  Core  Methods 

4.  Linear  Models 

- 

Ordinary  least  squares 

5.  Maximum  Likelihood  and 

- 

m-estimation  or  extremum 

Nonlinear  Least-Squares 
Estimation 

estimation 

6.  Generalized  Method  of 

5 

Instrumental  variables 

Moments  and  Systems 
Estimation 

7.  Hypothesis  Tests 

5 

Wald,  score,  and  likelihood  ratio 

tests 

8.  Specification  Tests  and  Model 

5,7 

Conditional  moment  test 

Selection 

9.  Semiparametric  Methods 

- 

Kernel  regression 

10.  Numerical  Optimization 

3.  Simulation-Based  Methods 

5 

Newton-Raphson  iterative  method 

11.  Bootstrap  Methods 

7 

Percentile  /-method 

12.  Simulation-Based  Methods 

5 

Maximum  simulated  likelihood 

13.  Bayesian  Methods 

4.  Models  for  Cross-Section  Data 

— 

Markov  chain  Monte  Carlo 

14.  Binary  Outcome  Models 

5 

Logit,  probit  for  y  —  (0.  1 ) 

15.  Multinomial  Models 

5,14 

Multinomial  logit  for 

y  —  (1, . . ,  m) 

16.  Tobit  and  Selection  Models 

5,14 

Tobit  for  y  =  max(y*,  0) 

17.  Transition  Data:  Survival 

5 

Cox  proportional  hazards  for 

Analysis 

y  =  min(y*,  c) 

18.  Mixture  Models  and 

5,17 

Unobserved  heterogeneity 

Unobserved  Heterogeneity 

19.  Models  for  Multiple  Hazards 

5,17 

Multiple  hazards 

20.  Models  of  Count  Data 

5.  Models  for  Panel  Data 

5 

Poisson  for  y  =  0,  1,2,... 

21.  Linear  Panel  Models:  Basics 

- 

Fixed  and  random  effects 

22.  Linear  Panel  Models: 

6,21 

Dynamic  and  endogenous 

Extensions 

regressors 

23.  Nonlinear  Panel  Models 

5,6,21,22 

Panel  logit,  Tobit,  and  Poisson 

6.  Further  Topics 

24.  Stratified  and  Clustered 

5 

Data  (jij,  Xjj )  correlated  over  j 

Samples 

25.  Treatment  Evaluation 

5,21 

Regressor  d  =  1  if  in  program 

26.  Measurement  Error  Models 

5 

Logit  model  with  measurement 

errors 

27.  Missing  Data  and  Imputation 

5 

Regression  with  missing 

observations 

a  The  background  gives  the  essential  chapter  needed  in  addition  to  the  treatment  of  ordinary  and  weighted  LS  in 
Chapter  4.  Note  that  the  first  panel  data  chapter  (Chapter  21)  requires  only  Chapter  4. 
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arena  of  regression  analysis.  Many  of  the  issues  raised  in  these  chapters  are  pursued 
throughout  the  book  as  the  reader  develops  the  necessary  tools. 


1.3.2.  Part  2:  Core  Methods 

Chapters  4-10  detail  the  main  general  methods  used  in  classical  estimation  and  sta¬ 
tistical  inference.  The  results  given  in  Chapter  5  in  particular  are  extensively  used 
throughout  the  book. 

Chapter  4  presents  some  results  for  the  linear  regression  model,  emphasizing  those 
issues  and  methods  that  are  most  relevant  for  the  rest  of  the  book.  Analysis  is  relatively 
straightforward  as  there  is  an  explicit  expression  for  linear  model  estimators  such  as 
ordinary  least  squares. 

Chapters  5  and  6  present  estimation  theory  that  can  be  applied  to  nonlinear  models 
for  which  there  is  usually  no  explicit  solution  for  the  estimator.  Asymptotic  theory 
is  used  to  obtain  the  distribution  of  estimators,  with  emphasis  on  obtaining  robust 
standard  error  estimates  that  rely  on  relatively  weak  distributional  assumptions.  A  quite 
general  treatment  of  estimation,  along  with  specialization  to  nonlinear  least-squares 
and  maximum  likelihood  estimation,  is  presented  in  Chapter  5.  The  more  challenging 
generalized  method  of  moments  estimator  and  specialization  to  instrumental  variables 
estimation  are  given  separate  treatment  in  Chapter  6. 

Chapter  7  presents  classical  hypothesis  testing  when  estimators  are  nonlinear  and 
the  hypothesis  being  tested  is  possibly  nonlinear  in  parameters.  Specification  tests  in 
addition  to  hypothesis  tests  are  the  subject  of  Chapter  8. 

Chapter  9  presents  semiparametric  estimation  methods  such  as  kernel  regression. 
The  leading  example  is  flexible  modeling  of  the  conditional  mean.  For  the  patents  ex¬ 
ample,  the  nonparametric  regression  model  is  E[y[.v]  =  g(x),  where  the  function  g(-) 
is  unspecified  and  is  instead  estimated.  Then  estimation  has  an  infinite-dimensional 
component  g(-)  leading  to  a  nonstandard  asymptotic  theory.  With  additional  regres¬ 
sors  some  further  structure  is  needed  and  the  methods  are  called  semiparametric  or 
seminonparametric. 

Chapter  10  presents  the  computational  methods  used  to  compute  a  parameter  esti¬ 
mate  when  the  estimator  is  defined  implicitly,  usually  as  the  solution  to  some  first-order 
conditions. 


1.3.3.  Part  3:  Simulation-Based  Methods 

Chapters  1 1-13  consider  methods  of  estimation  and  inference  that  rely  on  simulation. 
These  methods  are  generally  more  computationally  intensive  and,  currently,  less  uti¬ 
lized  than  the  methods  presented  in  Part  2. 

Chapter  1 1  presents  the  bootstrap  method  for  statistical  inference.  This  yields  the 
empirical  distribution  of  an  estimator  by  obtaining  new  samples  by  simulation,  such 
as  by  repeated  resampling  with  replacement  from  the  original  sample.  The  bootstrap 
can  provide  a  simple  way  to  obtain  standard  errors  when  the  formulas  from  asymp¬ 
totic  theory  are  complex,  as  is  the  case  for  some  two-step  estimators.  Furthermore,  if 
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implemented  appropriately,  the  bootstrap  can  lead  to  better  statistical  inference  in 
small  samples. 

Chapter  12  presents  simulation-based  estimation  methods,  developed  for  models 
that  involve  an  integral  over  a  probability  distribution  for  which  there  is  no  closed- 
form  solution.  Estimation  is  still  possible  by  making  multiple  draws  from  the  relevant 
distribution  and  averaging. 

Chapter  13  presents  Bayesian  methods,  which  combine  a  distribution  for  the  ob¬ 
served  data  with  a  specified  prior  distribution  for  parameters  to  obtain  a  posterior  dis¬ 
tribution  of  the  parameters  that  is  the  basis  for  estimation.  Recent  advances  make  com¬ 
putation  possible  even  if  there  is  no  closed-form  solution  for  the  posterior  distribution. 
Bayesian  analysis  can  provide  an  approach  to  estimation  and  inference  that  is  quite  dif¬ 
ferent  from  the  classical  approach.  However,  in  many  cases  only  the  Bayesian  tool  kit 
is  adopted  to  permit  classical  estimation  and  inference  for  problems  that  are  otherwise 
intractable. 


1.3.4.  Part  4:  Models  for  Cross-Section  Data 

Chapters  14-20  present  the  main  nonlinear  models  for  cross-section  data.  This  part  is 
the  heart  of  the  book  and  presents  advanced  topics  such  as  models  for  limited  depen¬ 
dent  variables  and  sample  selection.  The  classes  of  models  are  defined  by  the  range  of 
values  taken  by  the  dependent  variable. 

Binary  data  models  for  dependent  variable  that  can  take  only  two  possible  values, 
say  y  =  0  or  y  =  1,  are  presented  in  Chapter  14.  In  Chapter  15  an  extension  is  made  to 
multinomial  models,  for  dependent  variable  that  takes  several  discrete  values.  Exam¬ 
ples  include  employment  status  (employed,  unemployed,  and  out  of  the  labor  force) 
and  mode  of  transportation  to  work  (car,  bus,  or  train).  Linear  models  can  be  informa¬ 
tive  but  are  not  appropriate,  as  they  can  lead  to  predicted  probabilities  outside  the  unit 
interval.  Instead  logit,  probit,  and  related  models  are  used. 

Chapter  16  presents  models  with  censoring,  truncation,  sample  selection.  Exam¬ 
ples  include  annual  hours  of  work,  conditional  on  choosing  to  work,  and  hospital  ex¬ 
penditures,  conditional  on  being  hospitalized.  In  these  cases  the  data  are  incompletely 
observed  with  a  bunching  of  observations  at  y  =  0  and  with  the  remaining  y  >  0. 
The  model  for  the  observed  data  can  be  shown  to  be  nonlinear  even  if  the  underlying 
process  is  linear,  and  linear  regression  on  the  observed  data  can  be  very  misleading. 
Simple  corrections  for  censoring,  truncation,  or  sample  selection  such  as  the  Tobit 
model  exist,  but  these  are  very  dependent  on  distributional  assumptions. 

Models  for  duration  data  are  presented  in  Chapters  17-19.  An  example  is  length 
of  unemployment  spell.  Standard  regression  models  include  the  exponential,  Weibull, 
and  Cox  proportional  hazards  model.  Additionally,  as  in  Chapter  16,  the  dependent 
variable  is  often  incompletely  observed.  For  example,  the  data  may  be  on  the  length  of 
a  current  spell  that  is  incomplete,  rather  than  the  length  of  a  completed  spell. 

Chapter  20  presents  count  data  models.  Examples  include  various  measures  of 
health  utilization  such  as  number  of  doctor  visits  and  number  of  days  hospitalized. 
Again  the  model  is  nonlinear,  as  counts  and  hence  the  conditional  mean  are  nonnega¬ 
tive.  Leading  parametric  models  include  the  Poisson  and  negative  binomial. 
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1.3.5.  Part  5:  Models  for  Panel  Data 

Chapters  21-23  present  methods  for  panel  data.  Here  the  data  are  observed  in  several 
time  periods  for  each  of  the  many  individuals  in  the  sample,  so  the  dependent  variable 
and  regressors  are  indexed  by  both  individual  and  time.  Any  analysis  needs  to  control 
for  the  likely  positive  correlation  of  error  terms  in  different  time  periods  for  a  given  in¬ 
dividual.  Additionally,  panel  data  can  provide  sufficient  data  to  control  for  unobserved 
time-invariant  individual-specific  effects,  permitting  identification  of  causation  under 
weaker  assumptions  than  those  needed  if  only  cross-section  data  are  available. 

The  basic  linear  panel  data  model  is  presented  in  Chapter  21,  with  emphasis  on 
fixed  effects  and  random  effects  models.  Extensions  of  linear  models  to  permit  lagged 
dependent  variables  and  endogenous  regressors  are  presented  in  Chapter  22.  Panel 
methods  for  the  nonlinear  models  of  Part  4  are  presented  in  Chapter  23. 

The  panel  data  methods  are  placed  late  in  the  book  to  permit  a  unified  self-contained 
treatment.  Chapter  21  could  have  been  placed  immediately  after  Chapter  4  and  is  writ¬ 
ten  in  an  accessible  manner  that  relies  on  little  more  than  knowledge  of  least-squares 
estimation. 


1.3.6.  Part  6:  Further  Topics 

This  part  considers  important  topics  that  can  generally  relate  to  any  and  all  models 
covered  in  Parts  4  and  5.  Chapter  24  deals  with  modeling  of  clustered  data  in  sev¬ 
eral  different  models.  Chapter  25  discusses  treatment  evaluation.  Treatment  evaluation 
is  a  general  term  that  can  cover  a  wide  variety  of  models  in  which  the  focus  is  on 
measuring  the  impact  of  some  “treatment”  that  is  either  exogenously  or  randomly  as¬ 
signed  to  an  individual  on  some  measure  of  interest,  denoted  an  “outcome  variable.” 
Chapter  26  deals  with  the  consequences  of  measurement  errors  in  outcome  and/or 
regressor  variables,  with  emphasis  on  some  leading  nonlinear  models.  Chapter  27 
considers  some  methods  of  handling  missing  data  in  linear  and  nonlinear  regression 
models. 


1.4.  How  to  Use  This  Book 

The  book  assumes  a  basic  understanding  of  the  linear  regression  model  with  matrix 
algebra.  It  is  written  at  the  mathematical  level  of  the  first-year  economics  Ph.D.  se¬ 
quence,  comparable  to  Greene  (2003). 

Although  some  of  the  material  in  this  book  is  covered  in  a  first-year  sequence, 
most  of  it  appears  in  second-year  econometrics  Ph.D.  courses  or  in  data-oriented  mi¬ 
croeconomics  field  courses  such  as  labor  economics,  public  economics,  or  industrial 
organization.  This  book  is  intended  to  be  used  as  both  an  econometrics  text  and  as  an 
adjunct  for  such  field  courses.  More  generally,  the  book  is  intended  to  be  useful  as  a 
reference  work  for  applied  researchers  in  economics,  in  related  social  sciences  such  as 
sociology  and  political  science,  and  in  epidemiology. 

For  readers  using  this  book  as  a  reference  work,  the  models  chapters  have  been 
written  to  be  as  self-contained  as  possible.  For  the  specific  models  presented  in  Parts  4 
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Table  1.2.  Outline  of  a  20-Lecture  10-Week  Course 


Lectures 

Chapter 

Topic 

1-3 

4,  Appx.  A 

Review  of  linear  models  and  asymptotic  theory 

4-7 

5 

Estimation:  m-estimation,  ML,  and  NLS 

8 

10 

Estimation:  numerical  optimization 

9-11 

14,  15 

Models:  binary  and  multinomial 

12-14 

16 

Models:  censored  and  truncated 

15 

6 

Estimation:  GMM 

16 

7 

Testing:  hypothesis  tests 

17-19 

21 

Models:  basic  linear  panel 

20 

9 

Estimation:  semiparametric 

and  5  it  will  generally  be  sufficient  to  read  the  relevant  chapter  in  isolation,  except 
that  some  command  of  the  general  estimation  results  in  Chapter  5  and  in  some  cases 
Chapter  6  will  be  necessary.  Most  chapters  are  structured  to  begin  with  a  discussion 
and  example  that  is  accessible  to  a  wide  audience. 

For  instructors  using  this  book  as  a  course  text  it  is  best  to  introduce  the  basic  non¬ 
linear  cross-section  and  linear  panel  data  models  as  early  as  possible,  skipping  many 
of  the  methods  chapters.  The  most  commonly  used  nonlinear  cross-section  models 
are  presented  in  Chapters  14-16;  these  require  knowledge  of  maximum  likelihood 
and  least-squares  estimation,  presented  in  Chapter  5.  Chapter  21  on  linear  panel  data 
models  requires  even  less  preparation,  essentially  just  Chapter  4. 

Table  1.2  provides  an  outline  for  a  one-quarter  second-year  graduate  course  taught 
at  the  University  of  California,  Davis,  immediately  following  the  required  first-year 
statistics  and  econometrics  sequence.  A  quarter  provides  sufficient  time  to  cover  the 
basic  results  given  in  the  first  half  of  the  chapters  in  this  outline.  With  additional  time 
one  can  go  into  further  detail  or  cover  a  subset  of  Chapters  1 1-13  on  computation¬ 
ally  intensive  estimation  methods  (simulation-based  estimation,  the  bootstrap,  which 
is  also  briefly  presented  in  Chapter  7,  and  Bayesian  methods);  additional  cross-section 
models  (durations  and  counts)  presented  in  Chapters  17-20;  and  additional  panel  data 
models  (linear  model  extensions  and  nonlinear  models)  given  in  Chapters  22  and  23. 

At  Indiana  University,  Bloomington,  a  15-week  semester-long  field  course  in  mi¬ 
croeconometrics  is  based  on  material  in  most  of  Parts  4  and  5.  The  prerequisite  courses 
for  this  course  cover  material  similar  to  that  in  Part  2. 

Some  exercises  are  provided  at  the  end  of  each  chapter  after  the  first  three  intro¬ 
ductory  chapters.  These  exercises  are  usually  learning-by-doing  exercises;  some  are 
purely  methodological  whereas  others  entail  analysis  of  generated  or  actual  data.  The 
level  of  difficulty  of  the  questions  is  mostly  related  to  the  level  of  difficulty  of  the  topic. 


1.5.  Software 

There  are  many  software  packages  available  for  data  analysis.  Popular  packages  with 
strong  microeconometric  capabilities  include  LIMDEP,  SAS,  and  STATA,  all  of  which 
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offer  an  impressive  range  of  canned  routines  and  additionally  support  user-defined  pro¬ 
cedures  using  a  matrix  programming  language.  Other  packages  that  are  also  widely 
used  include  EVIEWS,  PCGIVE,  and  TSP.  Despite  their  time-series  orientation,  these 
can  support  some  cross-section  data  analysis.  Users  who  wish  to  do  their  own  pro¬ 
gramming  also  have  available  a  variety  of  options  including  GAUSS,  MATLAB,  OX, 
and  SAS/IML.  The  latest  detailed  information  about  these  packages  and  many  others 
can  be  efficiently  located  via  an  Internet  browser  and  a  search  engine. 


1.6.  Notation  and  Conventions 

Vector  and  matrix  algebra  are  used  extensively. 

Vectors  are  defined  as  column  vectors  and  represented  using  lowercase  bold.  For 
example,  for  linear  regression  the  regressor  vector  x  is  a  K  x  1  column  vector  with  y'th 
entry  Xj  and  the  parameter  vector  (ft  is  a  K  x  1  column  vector  with  j th  entry  ftj,  so 


X\ 

"A" 

X  = 

and  (ft  — 

(K  x  1) 

_XK_ 

(K  x  1) 

Jk_ 

Then  the  linear  regression  model  y  =  ft \ x |  +  ft 2X2  +  •  •  •  +  ft  kXk  +  u  is  expressed  as 
y  =  x (ft  +  u.  At  times  a  subscript  /'  is  added  to  denote  the  typical  /'th  observation.  The 
linear  regression  equation  for  the  i  th  observation  is  then 

yi  =  X;/3— (—  ut. 

The  sample  is  one  of  N  observations,  x,),  i  =  l, ,  N}.  In  this  book  observa¬ 
tions  are  usually  assumed  to  be  independent  over  i. 

Matrices  are  represented  using  uppercase  bold.  In  matrix  notation  the  sample  is 
(y,  X),  where  y  is  an  N  x  1  vector  with  /'th  entry  y,  and  X  is  a  matrix  with  /'th  row  x', 
so 


y 

yi 

and  X 

"x'l" 

(N  x  1) 

_yv_ 

(  Y  x  dim(x)) 

L  A/v  J 

The  linear  regression  model  upon  stacking  all  N  observations  is  then 

y  =  x/3  +  u, 

where  u  is  an  N  x  1  column  vector  with  i  th  entry  m,  . 

Matrix  notation  is  compact  but  at  times  it  is  clearer  to  write  products  of  matrices 
as  summations  of  products  of  vectors.  For  example,  the  OLS  estimator  can  be  equiva¬ 
lently  written  in  either  of  the  following  ways: 

/  N  \  -1  N 

p  =  (X'Xr'X'y  =  IJ2  x‘x'i  )  E  X'-Vi  ■ 
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Table  1.3.  Commonly  Used  Acronyms  and  Abbreviations 


OLS 

Ordinary  least  squares 

GLS 

Generalized  least  squares 

Linear 

FGLS 

Feasible  generalized  least  squares 

IV 

Instrumental  variables 

2SLS 

Two-stage  least  squares 

3SLS 

Three-stage  least  squares 

NLS 

Nonlinear  least  squares 

FGNLS 

Feasible  generalized  nonlinear  least  squares 

Nonlinear 

NIV 

Nonlinear  instrumental  variables 

NL2SLS 

Nonlinear  two-stage  least  squares 

NL3SLS 

Nonlinear  three-stage  least  squares 

LS 

Least  squares 

ML 

Maximum  likelihood 

General 

QML 

Quasi-maximum  likelihood 

GMM 

Generalized  method  of  moments 

GEE 

Generalized  estimating  equations 

Generic  notation  for  a  parameter  is  the  q  x  1  vector  6.  The  regression  parameters 
are  represented  by  the  K  x  1  vector  /3,  which  may  equal  6  or  may  be  a  subset  of  6 
depending  on  the  context. 

The  book  uses  many  abbreviations  and  acronyms.  Table  1.3  summarizes  abbrevia¬ 
tions  used  for  some  common  estimation  methods,  ordered  by  whether  the  estimator  is 
developed  for  linear  or  nonlinear  regression  models.  We  also  use  the  following:  dgp 
(data-generating  process),  iid  (independently  and  identically  distributed),  pdf  (prob¬ 
ability  density  function),  cdf  (cumulative  distribution  function),  L  (likelihood),  In  L 
(log-likelihood),  FE  (fixed  effects),  and  RE  (random  effects). 
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Causal  and  Noncausal  Models 


2.1.  Introduction 

Microeconometrics  deals  with  the  theory  and  applications  of  methods  of  data  analysis 
developed  for  microdata  pertaining  to  individuals,  households,  and  firms.  A  broader 
definition  might  also  include  regional-  and  state-level  data.  Microdata  are  usually 
either  cross  sectional,  in  which  case  they  refer  to  conditions  at  the  same  point  in 
time,  or  longitudinal  (panel)  in  which  case  they  refer  to  the  same  observational  units 
over  several  periods.  Such  observations  are  generated  from  both  nonexperimental 
setups,  such  as  censuses  and  surveys,  and  quasi-experimental  or  experimental  setups, 
such  as  social  experiments  implemented  by  governments  with  the  participation  of 
volunteers. 

A  microeconometric  model  may  be  a  full  specification  of  the  probability  distribu¬ 
tion  of  a  set  of  microeconomic  observations;  it  may  also  be  a  partial  specification  of 
some  distributional  properties,  such  as  moments,  of  a  subset  of  variables.  The  mean  of 
a  single  dependent  variable  conditional  on  regressors  is  of  particular  interest. 

There  are  several  objectives  of  microeconometrics.  They  include  both  data  descrip¬ 
tion  and  causal  inference.  The  first  can  be  defined  broadly  to  include  moment  prop¬ 
erties  of  response  variables,  or  regression  equations  that  highlight  associations  rather 
than  causal  relations.  The  second  category  includes  causal  relationships  that  aim  at 
measurement  and/or  empirical  confirmation  or  refutation  of  conjectures  and  proposi¬ 
tions  regarding  microeconomic  behavior.  The  type  and  style  of  empirical  investigations 
therefore  span  a  wide  spectrum.  At  one  end  of  the  spectrum  can  be  found  very  highly 
structured  models,  derived  from  detailed  specification  of  the  underlying  economic  be¬ 
havior,  that  analyze  causal  (behavioral)  or  structural  relationships  for  interdependent 
microeconomic  variables.  At  the  other  end  are  reduced  form  studies  that  aim  to  un¬ 
cover  correlations  and  associations  among  variables,  without  necessarily  relying  on 
a  detailed  specification  of  all  relevant  interdependencies.  Both  approaches  share  the 
common  goal  of  uncovering  important  and  striking  relationships  that  could  be  helpful 
in  understanding  microeconomic  behavior,  but  they  differ  in  the  extent  to  which  they 
rely  on  economic  theory  to  guide  their  empirical  investigations. 
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As  a  subdiscipline  microeconometrics  is  newer  than  macroeconometrics,  which  is 
concerned  with  modeling  of  market  and  aggregate  data.  A  great  deal  of  the  early 
work  in  applied  econometrics  was  based  on  aggregate  time-series  data  collected  by 
government  agencies.  Much  of  the  early  work  on  statistical  demand  analysis  up  until 
about  1940  used  market  rather  than  individual  or  household  data  (Hendry  and  Morgan, 
1996).  Morgan’s  (1990)  book  on  the  history  of  econometric  ideas  makes  no  reference 
to  microeconometric  work  before  the  1940s,  with  one  important  exception.  That  ex¬ 
ception  is  the  work  on  household  budget  data  that  was  instigated  by  concern  with  the 
living  standards  of  the  less  well-off  in  many  countries.  This  led  to  the  collection  of 
household  budget  data  that  provided  the  raw  material  for  some  of  the  earlier  microe¬ 
conometric  studies  such  as  those  pioneered  by  Allen  and  Bowley  (1935).  Nevertheless, 
it  is  only  since  the  1950s  that  microeconometrics  has  emerged  as  a  distinctive  and  rec¬ 
ognized  subdiscipline.  Even  into  the  1960s  the  core  of  microeconometrics  consisted 
of  demand  analyses  based  on  household  surveys. 

With  the  award  of  the  year  2000  Nobel  Prize  in  Economics  to  James  Heckman 
and  Daniel  McFadden  for  their  contributions  to  microeconometrics,  the  subject  area 
has  achieved  clear  recognition  as  a  distinct  subdiscipline.  The  award  cited  Heckman 
“for  his  development  of  theory  and  methods  for  analyzing  selective  samples”  and 
McFadden  “for  his  development  of  theory  and  methods  for  analyzing  discrete  choice.” 
Examples  of  the  type  of  topics  that  microeconometrics  deals  with  were  also  men¬ 
tioned  in  the  citation:  “ . . .  what  factors  determine  whether  an  individual  decides  to 
work  and,  if  so,  how  many  hours?  How  do  economic  incentives  affect  individual 
choices  regarding  education,  occupation  or  place  of  residence?  What  are  the  effects 
of  different  labor-market  and  educational  programs  on  an  individual’s  income  and 
employment?” 

Applications  of  microeconometric  methods  can  be  found  not  only  in  every  area  of 
microeconomics  but  also  in  other  cognate  social  sciences  such  as  political  science, 
sociology,  and  geography. 

Beginning  with  the  1970s  and  especially  within  the  past  two  decades  revolution¬ 
ary  advances  in  our  capacity  for  handling  large  data  sets  and  associated  computations 
have  taken  place.  These,  together  with  the  accompanying  explosion  in  the  availability 
of  large  microeconomic  data  sets,  have  greatly  expanded  the  scope  of  microecono¬ 
metrics.  As  a  result,  although  empirical  demand  analysis  continues  to  be  one  of  the 
most  important  areas  of  application  for  microeconometric  methods,  its  style  and  con¬ 
tent  have  been  heavily  influenced  by  newer  methods  and  models.  Further,  applications 
in  economic  development,  finance,  health,  industrial  organization,  labor  and  public 
economics,  and  applied  microeconomics  generally  are  now  commonplace,  and  these 
applications  will  be  encountered  at  various  places  in  this  book. 

The  primary  focus  of  this  book  is  on  the  newer  material  that  has  emerged  in  the 
past  three  decades.  Our  goal  is  to  survey  concepts,  models,  and  methods  that  we  re¬ 
gard  as  standard  components  of  a  modern  microeconometrician’s  tool  kit.  Of  course, 
the  notion  of  standard  methods  and  models  is  inevitably  both  subjective  and  elastic, 
being  a  function  of  the  presumed  clientele  of  this  book  as  well  as  the  authors’  own 
backgrounds.  There  may  also  be  topics  we  regard  as  too  advanced  for  an  introductory 
book  such  as  this  that  others  would  place  in  a  different  category. 
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Microeconometrics  focuses  on  the  complications  of  nonlinear  models  and  on  ob¬ 
taining  estimates  that  can  be  given  a  structural  interpretation.  Much  of  this  book,  es¬ 
pecially  Parts  2—4,  presents  methods  for  nonlinear  models.  These  nonlinear  methods 
overlap  with  many  areas  of  applied  statistics  including  biostatistics.  By  contrast,  the 
distinguishing  feature  of  econometrics  is  the  emphasis  placed  on  causal  modeling. 
This  chapter  introduces  the  key  concepts  related  to  causal  (and  noncausal)  modeling, 
concepts  that  are  germane  to  both  linear  and  nonlinear  models. 

Sections  2.2  and  2.3  introduce  the  key  concepts  of  structure  and  exogeneity. 
Section  2.4  uses  the  linear  simultaneous  equations  model  as  a  specific  illustration 
of  a  structural  model  and  connects  it  with  the  other  important  concepts  of  reduced 
form  models.  Identification  definitions  are  given  in  Section  2.5.  Section  2.6  considers 
single-equation  structural  models.  Section  2.7  introduces  the  potential  outcome  model 
and  compares  the  causal  parameters  and  interpretations  in  the  potential  outcome  model 
with  those  in  the  simultaneous  equations  model.  Section  2.8  provides  a  brief  discus¬ 
sion  of  modeling  and  estimation  strategies  designed  to  handle  computational  and  data 
challenges. 


2.2.  Structural  Models 


Structure  consists  of 

1.  a  set  of  variables  W  (“data”)  partitioned  for  convenience  as  [Y  Z]; 

2.  a  joint  probability  distribution  of  W,  F( W); 

3.  an  a  priori  ordering  of  W  according  to  hypothetical  cause-and-effect  relationships  and 
specification  of  a  priori  restrictions  on  the  hypothesized  model;  and 

4.  a  parametric,  semiparametric,  or  nonparametric  specification  of  functional  forms  and 
the  restrictions  on  the  parameters  of  the  model. 

This  general  description  of  a  structural  model  is  consistent  with  a  well-established 
Cowles  Commission  definition  of  a  structure.  For  example,  Sargan  (1988,  p.  27)  states: 

A  model  is  the  specification  of  the  probability  distribution  for  a  set  of  observations. 

A  structure  is  the  specification  of  the  parameters  of  that  distribution.  Therefore,  a 
structure  is  a  model  in  which  all  the  parameters  are  assigned  numerical  values. 

We  consider  the  case  in  which  the  modeling  objective  is  to  explain  the  values  of 
observable  vector-valued  variable  y,  y'  =  (yi, . . .  ,  yG).  Each  element  of  y  is  a  func¬ 
tion  of  some  other  elements  of  y  and  of  explanatory  variables  z  and  a  purely  random 
disturbance  u.  Note  that  the  variables  y  are  assumed  to  be  interdependent.  By  contrast, 
interdependence  between  z,  is  not  modeled.  The  / th  observation  satisfies  the  set  of 
implicit  equations 

g(y,-,z.,u;|0)  =  0,  (2.1) 

where  g  is  a  known  function.  We  refer  to  this  as  the  structural  model,  and  we  refer  to 
6  as  structural  parameters.  This  corresponds  to  property  4  given  earlier  in  this  section. 
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Assume  that  there  is  a  unique  solution  for  y,  for  every  (z,  ,  u,).  Then  we  can  write 
the  equations  in  an  explicit  form  with  y  as  function  of  (z,  u): 

y,  =  f  (z,,  u,|7r) .  (2.2) 

This  is  referred  to  as  the  reduced  form  of  the  structural  model,  where  tv  is  a  vector 
of  reduced  form  parameters  that  are  functions  of  9.  The  reduced  form  is  obtained 
by  solving  the  structural  model  for  the  endogenous  variables  y given  (z u,).  The 
reduced  form  parameters  tv  are  functions  of  9. 

If  the  objective  of  modeling  is  inference  about  elements  of  9,  then  (2.1)  provides  a 
direct  route.  This  involves  estimation  of  the  structural  model.  However,  because  ele¬ 
ments  of  7r  are  functions  of  9,  (2.2)  also  provides  an  indirect  route  to  inference  on  8. 
If  f(z, ,  u;  |7r)  has  a  known  functional  form,  and  if  it  is  additively  separable  in  z,  and  u, , 
such  that  we  can  write 


y  i  =  g  (Z;  |tt)  +  u,  =  E  [y,  |z ,]  +  u ,,  (2.3) 

then  the  regression  of  y  on  z  is  a  natural  prediction  function  for  y  given  z.  In  this 
sense  the  reduced  form  equation  has  a  useful  role  for  making  conditional  predictions 
of  y,  given  (z,  ,  u,  ).  To  generate  predictions  of  the  left-hand-side  variable  for  assigned 
values  of  the  right-hand-side  variables  of  (2.2)  requires  estimates  of  7r,  which  may  be 
computationally  simpler. 

An  important  extension  of  (2.3)  is  the  transformation  model,  which  for  scalar  y 
takes  the  form 


A(v)  =  z'tv  +  u,  (2.4) 

where  A(y)  is  a  transformation  function  (e.g.,  A(y)  =  ln(y)  or  A(y)  =  y1/2).  In  some 
cases  the  transformation  function  may  depend  on  unknown  parameters.  A  transfor¬ 
mation  model  is  distinct  from  a  regression,  but  it  too  can  be  used  to  make  estimates 
of  E[y|z].  An  important  example  is  the  accelerated  failure  time  model  analyzed  in 
Chapter  17. 

One  of  the  most  important,  and  potentially  controversial,  steps  in  the  specification 
of  the  structural  model  is  property  3,  in  which  an  a  priori  ordering  of  variables  into 
causes  and  effects  is  assigned.  In  essence  this  involves  drawing  a  distinction  between 
those  variables  whose  variation  the  model  is  designed  to  explain  and  those  whose 
variation  is  externally  determined  and  hence  lie  outside  the  scope  of  investigation.  In 
microeconometrics,  examples  of  the  former  are  years  of  schooling  and  hours  worked; 
examples  of  the  latter  are  gender,  ethnicity,  age,  and  similar  demographic  variables. 
The  former,  denoted  y,  are  referred  to  as  endogenous  and  the  latter,  denoted  z,  are 
called  exogenous  variables. 

Exogeneity  of  a  variable  is  an  important  simplification  because  in  essence  it  jus¬ 
tifies  the  decision  to  treat  that  variable  as  ancillary,  and  not  to  model  that  variable 
because  the  parameters  of  that  relationship  have  no  direct  bearing  on  the  variable 
under  study.  This  important  notion  needs  a  more  formal  definition,  which  we  now 
provide. 
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2.3.  Exogeneity 

We  begin  by  considering  the  representation  of  a  general  finite  dimensional  parametric 
case  in  which  the  joint  distribution  of  W,  with  parameters  0  partitioned  as  (0\  62),  is 
factored  into  the  conditional  density  of  Y  given  Z,  and  the  marginal  distribution  of  Z, 
giving 


fj  (W|0)  =  fc  (Y|Z,  6)  x  fu  (Z| 0) .  (2.5) 

A  special  case  of  this  result  occurs  if 

fj  (W|0)  =  fc  (Y|Z,  0i)  x  fM  (Z\02) , 

where  0\  and  02  are  functionally  independent.  Then  we  say  that  Z  is  exogenous  with 
respect  to  0\ ;  this  means  that  knowledge  of  fM( Z\02)  is  not  required  for  inference  on 
0i,  and  hence  we  can  validly  condition  the  distribution  of  Y  on  Z. 

Models  can  always  be  reparameterized.  So  next  consider  the  case  in  which  the 
model  is  reparameterized  in  terms  of  parameters  ip,  with  one-to-one  transformation 
of  0,  say  tp  =  li(0).  where  <p  is  partitioned  into  (cp, ,  ip2).  This  reparametrization  may 
be  of  interest  if,  for  example,  ipl  is  structurally  invariant  to  a  class  of  policy  interven¬ 
tions.  Suppose  <pl  is  the  parameter  of  interest.  In  such  a  case  one  is  interested  in  the 
exogeneity  of  Z  with  respect  to  ipi.  Then,  the  condition  for  exogeneity  is  that 

fj  (WM  =  fc  (Y|Z,  Vl)  x  fM  (Z|v>2)  ,  (2.6) 

where  <pt  is  independent  of  ip2. 

Finally  consider  the  case  in  which  the  interest  is  in  a  parameter  A  that  is  a  function 
of  ip,  say  h(p).  Then  for  exogeneity  of  Z  with  respect  to  A,  we  need  two  conditions: 
(i)  A  depends  only  on  i.e.,  A  =  Mp\ ),  and  hence  only  the  conditional  distribution  is 
of  interest;  and  (ii)  pl  and  p>2  are  “variation  free”  which  means  that  the  parameters  of 
the  joint  distribution  are  not  subject  to  cross-restrictions,  i.e.  (pl:  p2)  €  $1  x  $2  = 
{< P 1  €  4>i,  p2  e  02}. 

The  factorization  in  (2.5)-(2.6)  plays  an  important  role  in  the  development  of  the 
exogeneity  concept.  Of  special  interest  in  this  book  are  the  following  three  con¬ 
cepts  related  to  exogeneity:  (1)  weak  exogeneity;  (2)  Granger  noncausality;  (3)  strong 
exogeneity. 

Definition  2.1  (Weak  Exogeneity ):  Z  is  weakly  exogenous  for  A  if  (i)  and  (ii) 

hold. 

If  the  marginal  model  parameters  are  uninformative  for  inference  on  A,  then  infer¬ 
ence  on  A  can  proceed  on  the  basis  of  the  conditional  distribution  /(Y|Z,  pt  )  alone. 
The  operational  implication  is  that  weakly  exogenous  variables  can  be  taken  as  given 
if  one’s  main  interest  is  in  inference  on  A  or  ipl.  This  does  not  mean  that  there  is  no 
statistical  model  for  Z;  it  means  that  the  parameters  of  that  model  play  no  role  in  the 
inference  on  and  hence  are  irrelevant. 
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2.3.1.  Conditional  Independence 

Originally,  the  Granger  causality  concept  was  defined  in  the  context  of  prediction  in  a 
time-series  environment.  More  generally,  it  can  be  interpreted  as  a  form  of  conditional 
independence  (Holland,  1986,  p.  957). 

Partition  z  into  two  subsets  Z\  and  z2;  let  W  =  [y,  z, ,  z2]  be  the  matrices  of  vari¬ 
ables  of  interest.  Then  Z\  and  y  are  conditionally  independent  given  z2  if 

/(y|zi,z2)  = /(y|z2).  (2.8) 

This  is  stronger  than  the  mean  independence  assumption,  which  would  imply 

E[y|zi,z2]  =  E  [y|z2] .  (2.9) 

Then  Z\  has  no  predictive  value  for  y,  after  conditioning  on  z2.  In  a  predictive  sense 
this  means  that  Z\  does  not  Granger-cause  y. 

In  a  time-series  context,  Z\  and  z2  would  be  mutually  exclusive  lagged  values  of 
subsets  of  y. 

Definition  2.2  ( Strong  Exogeneity ):  z\  is  strongly  exogenous  for  <p  if  it  is 
weakly  exogenous  for  cp  and  does  not  Granger-cause  y  so  (2.8)  holds. 


2.3.2.  Exogenizing  Variables 

Exogeneity  is  a  strong  assumption.  It  is  a  property  of  random  variables  relative  to 
parameters  of  interest.  Hence  a  variable  may  be  validly  treated  as  exogenous  in  one 
structural  model  but  not  in  another;  the  key  issue  is  the  parameters  that  are  the  subject 
of  inference.  Arbitrary  imposition  of  this  property  will  have  some  undesirable  conse¬ 
quences  that  will  be  discussed  in  Section  2.4. 

The  exogeneity  assumption  may  be  justified  by  a  priori  theorizing,  in  which  case  it 
is  a  part  of  the  maintained  hypothesis  of  the  model.  It  may  in  some  cases  be  justified 
as  a  valid  approximation,  in  which  case  it  may  be  subject  to  testing,  as  discussed  in 
Section  8.4.3.  In  cross-section  analysis  it  may  be  justified  as  being  a  consequence  of 
a  natural  experiment  or  a  quasi-experiment  in  which  the  value  of  the  variable  is  de¬ 
termined  by  an  external  intervention;  for  example,  government  or  regulatory  authority 
may  determine  the  setting  of  a  tax  rate  or  a  policy  parameter.  Of  special  interest  is  the 
case  in  which  an  external  intervention  results  in  a  change  in  the  value  of  an  impor¬ 
tant  policy  variable.  Such  a  natural  experiment  is  tantamount  to  exogenization  of  some 
variable.  As  we  shall  see  in  Chapter  3,  this  creates  a  quasi-experimental  opportunity  to 
study  the  impact  of  a  variable  in  the  absence  of  other  complicating  factors. 


2.4.  Linear  Simultaneous  Equations  Model 

An  important  special  case  of  the  general  structural  model  specified  in  (2.1)  is  the  linear 
simultaneous  equation  model  developed  by  the  Cowles  Commission  econometricians. 
Comprehensive  treatment  of  this  model  is  available  in  many  textbooks  (e.g.,  Sargan, 
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1988).  The  treatment  here  is  brief  and  selective;  also  see  Section  6.9.6.  The  objective  is 
to  bring  into  the  discussion  several  key  ideas  and  concepts  that  have  more  general  rele¬ 
vance.  Although  the  analysis  is  restricted  to  linear  models,  many  insights  are  routinely 
applied  to  nonlinear  models. 


2.4.1.  The  SEM  Setup 

The  linear  simultaneous  equations  model  (SEM)  setup  is  as  follows: 

yiiPn  +  •  •  •  +  yaP\G  +  zuYn  +  •  •  •  +  ZkiYik  —  «i  / 


yiiPci  +  •  •  •  +  yaPcG  +  ZliYGl  +  •  •  •  +  ZKiYGK  —  UGi, 
where  i  is  the  observation  subscript. 

A  clear  a  priori  distinction  or  preordering  is  made  between  endogenous  variables, 
y'  =  (yu, . . . ,  yoi),  and  exogenous  variables,  z •  =  (zi* ,  . . .,  ZKi)-  By  definition  the  ex¬ 
ogenous  variables  are  uncorrelated  with  the  purely  random  disturbances  (u  i, , . . . ,  uq /)• 
In  its  unrestricted  form  every  variable  enters  every  equation. 

In  matrix  notation,  the  G-equation  SEM  for  the  ;  th  equation  is  written  as 

y;B  +  z;r  =  u;,  (2.10) 

where  y, ,  B,  z, ,  I\  and  u,  have  dimensions  Gx  l,Gx  G,  K  x  1 ,  K  x  G,  and  G  x  1 , 
respectively.  For  specified  values  of  (B,  T)  and  (z u,)  G  linear  simultaneous  equa¬ 
tions  can  in  principle  be  solved  for  y, . 

The  standard  assumptions  of  SEM  are  as  follows: 

1.  B  is  nonsingular  and  has  rank  G. 

2.  rank[Z]  =  K.  The  N  x  K  matrix  Z  is  formed  by  stacking  z' ,  i  =  1,  . . . ,  N . 

3.  plim  A_1Z'Z  =  Szz  is  a  symmetric  K  x  K  positive  definite  matrix. 

4.  u,  ~  A/"[0,  S];  that  is,  E[u,-]  =  0  and  E[u,u']  =  E  =  [rr;/- ] ,  where  S  is  a  symmetric 
G  x  G  positive  definite  matrix. 

5.  The  errors  in  each  equation  are  serially  independent. 

In  this  model  the  structure  (or  structural  parameters)  consists  of  (B.  T.  S).  Writing 


■y'r 

_zi " 

u'l 

Y  = 

yN- 

,  z  = 

-4- 

,  u  = 

_4_ 

allows  us  to  express  the  structural  model  more  compactly  as 

YB  +  Zr=U,  (2.11) 

where  the  arrays  Y,  B,  Z,  T,  and  U  have  dimensions  N  x  G,  G  x  G,  N  x  K ,  K  x 
G,  and  N  x  G,  respectively.  Solving  for  all  the  endogenous  variables  in  terms  of  all 
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the  exogenous  variables,  we  obtain  the  reduced  form  of  the  SEM: 

Y  +  ZrB”1  =  UB“\ 

Y  =  zn  +  V,  (2.12) 

where  II  =  —  TB-1  and  V  =  UB  '.  Given  Assumption  4,  v,-  ~  Af[0,  B-r£B-1]. 

In  the  SEM  framework  the  structural  model  has  primacy  for  several  reasons.  First, 
the  equations  themselves  have  interpretations  as  economic  relationships  such  as  de¬ 
mand  or  supply  relations,  production  functions,  and  so  forth,  and  they  are  subject  to 
restrictions  of  economic  theory.  Consequently,  B  and  E  are  parameters  that  describe 
economic  behavior.  Hence  a  priori  theory  can  be  invoked  to  form  expectations  about 
the  sign  and  size  of  individual  coefficients.  By  contrast,  the  unrestricted  reduced  form 
parameters  are  potentially  complicated  functions  of  the  structural  parameters,  and  as 
such  it  may  be  difficult  to  evaluate  them  postestimation.  This  consideration  may  have 
little  weight  if  the  goal  of  econometric  modeling  is  prediction  rather  than  inference  on 
parameters  with  behavioral  interpretation. 

Consider,  without  loss  of  generality,  the  first  equation  in  the  model  (2.11),  with  yi 
as  the  dependent  variable.  In  addition,  some  of  the  remaining  G  —  1  endogenous  vari¬ 
ables  and  K  —  1  exogenous  variables  may  be  absent  from  this  equation.  From  (2.12) 
we  see  that  in  general  the  endogenous  variables  Y  depend  stochastically  on  Y,  which 
in  turn  is  a  function  of  the  structural  errors  U.  Therefore,  in  general  plim  /V  “ 1 Y' U  ^  0. 
Generally,  the  application  of  the  least-squares  estimator  in  the  simultaneous  equation 
setting  yields  inconsistent  estimates.  This  is  a  well-known  and  basic  result  from  the  si¬ 
multaneous  equations  literature,  often  referred  to  as  the  “simultaneous  equations  bias” 
problem.  The  vast  literature  on  simultaneous  equations  models  deals  with  identifica¬ 
tion  and  consistent  estimation  when  the  least-squares  approach  fails;  see  Sargan  (1988) 
and  Schmidt  (1976),  and  Section  6.9.6. 

The  reduced  form  of  SEM  expresses  every  endogenous  variable  as  a  linear  function 
of  all  exogenous  variables  and  all  structural  disturbances.  The  reduced  form  distur¬ 
bances  are  linear  combinations  of  the  structural  disturbances.  From  the  reduced  form 
for  the  /th  observation 


E  [y,-  |z;]  =  zjn,  (2.13) 

V[y,jz,]  =  f2  =  B_1,SB_1.  (2.14) 

The  reduced  form  parameters  II  are  derived  parameters  defined  as  functions  of  the 
structural  parameters.  If  II  can  be  consistently  estimated  then  the  reduced  form  can 
be  used  to  make  predictive  statements  about  variations  in  Y  due  to  exogenous  changes 
in  Z.  This  is  possible  even  if  B  and  T  are  not  known.  Given  the  exogeneity  of  Z, 
the  full  set  of  reduced  form  regressions  is  a  multivariate  regression  model  that  can  be 
estimated  consistently  by  least  squares.  The  reduced  form  provides  a  basis  for  making 
conditional  predictions  of  Y  given  Z. 

The  restricted  reduced  form  is  the  unrestricted  reduced  form  model  subject  to  re¬ 
strictions.  If  these  are  the  same  restrictions  as  those  that  apply  to  the  structure,  then 
structural  information  can  be  recovered  from  the  reduced  form. 
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In  the  SEM  framework,  the  unknown  structural  parameters,  the  nonzero  elements 
of  B,  r,  and  X,  play  a  key  role  because  they  reflect  the  causal  structure  of  the 
model.  The  interdependence  between  endogenous  variables  is  described  by  B,  and 
the  responses  of  endogenous  variables  to  exogenous  shocks  in  Z  is  reflected  in  the 
parameter  matrix  T.  In  this  setup  the  causal  parameters  of  interest  are  those  that 
measure  the  direct  marginal  impact  of  a  change  in  an  explanatory  variable,  yj  or 
Zk  on  the  outcome  of  interest  >7,  /  7^  j,  and  functions  of  such  parameters  and  data. 
The  elements  of  X  describe  the  dispersion  and  dependence  properties  of  the  ran¬ 
dom  disturbances,  and  hence  they  measure  some  properties  of  the  way  the  data  are 
generated. 


2.4.2.  Causal  Interpretation  in  SEM 

A  simple  example  will  illustrate  the  causal  interpretation  of  parameters  in  SEM.  The 
structural  model  has  two  continuous  endogenous  variables  y\  and  V2,  a  single  con¬ 
tinuous  exogenous  variable  z  1,  one  stochastic  relationship  linking  yi  and  y2 ,  and  one 
definitional  identity  linking  all  three  variables  in  the  model: 

yi  =  yi  +  Piyi  +  mu  o<  A  <  1, 
y2  =  yi  +  zi. 


In  this  model  ii\  is  a  stochastic  disturbance,  independent  of  zi,  with  a  well-defined 
distribution.  The  parameter  0i  is  subject  to  an  inequality  constraint  that  is  also  a  part 
of  the  model  specification.  The  variable  z  1  is  exogenous  and  therefore  its  variation  is 
induced  by  external  sources  that  we  may  regard  as  interventions.  These  interventions 
have  a  direct  impact  on  y2  through  the  identity  and  also  an  indirect  one  through  the 
first  equation.  The  impact  is  measured  by  the  reduced  form  of  the  model,  which  is 


yi 


y  2 


Yi 


+ 


1  -  01  1  -  01 
=  E[vi|zi]  +  vi. 


1  ,  1 

Z\  +  - - —u  1 


1-01 


Yi  ,  1  ,1 

+  1 - —z  1  +  - - —u  1 


1-01  l-0i 
=  E[.V2|Zl]  +  Vu 


1-01 


where  iq  =  m/(l  —  0i).  The  reduced  form  coefficients  0i/(l  —  0i)  and  1/(1  —  0i) 
have  a  causal  inteipretation.  Any  externally  induced  unit  change  in  z\  will  cause  the 
value  of  yi  and  yo  to  change  by  these  amounts.  Note  that  in  this  model  yi  and  >>2  also 
respond  to  u  1 .  In  order  not  to  confound  the  impact  of  the  two  sources  of  variation  we 
require  that  zi  and  a  \  are  independent. 

Also  note  that 

9yi  =  0i  .  1 

dy2  1  -  0,  ’  1  -  0i 

_  dyi  _  dy2 
3zi  ’  3zi 
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In  what  sense  does  /fi  measure  the  causal  effect  of  V2  on  y{!  To  see  a  possible  diffi¬ 
culty,  observe  that  yi  and  V2  are  interdependent  or  jointly  determined,  so  it  is  unclear 
in  what  sense  y2  “causes”  yi.  Although  z  1  (and  ui)  is  the  ultimate  cause  of  changes 
in  the  reduced  form  sense,  >>2  is  a  proximate  or  an  intermediate  cause  of  yi  -  That  is, 
the  first  structural  equation  provides  a  snapshot  of  the  impact  of  y2  on  y i ,  whereas 
the  reduced  form  gives  the  (equilibrium)  impact  after  allowing  for  all  interactions  be¬ 
tween  the  endogenous  variables  to  work  themselves  out.  In  a  SEM  framework  even 
endogenous  variables  are  viewed  as  causal  variables,  and  their  coefficients  as  causal 
parameters.  This  approach  can  cause  puzzlement  for  those  who  view  causality  in  an 
experimental  setting  where  independent  sources  of  variation  are  the  causal  variables. 
The  SEM  approach  makes  sense  if  yi  has  an  independent  and  exogenous  source  of 
variation,  which  in  this  model  is  zi-  Hence  the  marginal  response  coefficient  (J>\  is  a 
function  of  how  yi  and  V2  respond  to  a  change  in  zi,  as  the  immediately  preceding 
equation  makes  clear. 

Of  course  this  model  is  but  a  special  case.  More  generally,  we  may  ask  under  what 
conditions  will  the  SEM  parameters  have  a  meaningful  causal  interpretation.  We  return 
to  this  issue  when  discussing  identification  concepts  in  Section  2.5. 


2.4.3.  Extensions  to  Nonlinear  and  Latent  Variable  Models 

If  the  simultaneous  model  is  nonlinear  in  parameters  only,  the  structural  model  can 
be  written  as 


YB(0)  +  ZT(6)  —  U,  (2.15) 

where  B (9)  and  FiO)  are  matrices  whose  elements  are  functions  of  the  structural  pa¬ 
rameters  6.  An  explicit  reduced  form  can  be  derived  as  before. 

If  nonlinearity  is  instead  in  variables  then  an  explicit  (analytical)  reduced  form 
may  not  be  possible,  although  linearized  approximations  or  numerical  solutions  of  the 
dependent  variables,  given  (z,  u),  can  usually  be  obtained. 

Many  microeconometric  models  involve  latent  or  unobserved  variables  as  well  as 
observed  endogenous  variables.  For  example,  search  and  auction  theory  models  use  the 
concept  of  reservation  wage  or  reservation  price,  choice  models  invoke  indirect  utility, 
and  so  forth.  In  the  case  of  such  models  the  structural  model  (2.1)  may  be  replaced  by 

g(y?,z,>Ui|0)=O,  (2.16) 

where  the  latent  variables  y*  replace  the  observed  variables  y The  corresponding 
reduced  form  solves  for  y*  in  terms  of  (z, ,  u,),  yielding 

y*  =  f  (z;,  U;|tt)  .  (2.17) 

This  reduced  form  has  limited  usefulness  as  y*  is  not  fully  observed.  However,  if  we 
have  functions  y,  =  h(y*)  that  relate  observable  with  latent  counterparts  of  y, ,  then  the 
reduced  form  in  terms  of  observables  is 

y,  =  h(f(z,,  u,|7T)).  (2.18) 

See  Section  16.8.2  for  further  details. 
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When  the  structural  model  involves  nonlinearities  in  variables,  or  when  latent  vari¬ 
ables  are  involved,  an  explicit  derivation  of  the  functional  form  of  this  reduced  form 
may  be  difficult  to  obtain.  In  such  cases  practitioners  use  approximations.  By  citing 
mathematical  or  computational  convenience,  a  specific  functional  form  may  be  used 
to  relate  an  endogenous  variable  to  all  exogenous  variables,  and  the  result  would  be 
referred  to  as  a  “reduced  form  type  relationship.” 


2.4.4.  Interpretations  of  Structural  Relationships 

Marschak  (1953,  p.  26)  in  an  influential  essay  gave  the  following  definition  of  a 
structure: 

Structure  was  defined  as  a  set  of  conditions  which  did  not  change  while  observations 
were  being  made  but  which  might  change  in  future.  If  a  specified  change  of  struc¬ 
ture  is  expected  or  intended,  prediction  of  variables  of  interest  to  the  policy  maker 
requires  some  knowledge  of  past  structure. ...  In  economics,  the  conditions  that  con¬ 
stitute  a  structure  are  (1)  a  set  of  relations  describing  human  behavior  and  institutions 
as  well  as  technological  laws  and  involving,  in  general,  nonobservable  random  dis¬ 
turbances  and  nonobservable  random  errors  of  measurement;  (2)  the  joint  probability 
distribution  of  these  random  quantities. 

Marschak  argued  that  the  structure  was  fundamental  for  a  quantitative  evaluation  or 
tests  of  economic  theory  and  that  the  choice  of  the  best  policy  requires  knowledge  of 
the  structure. 

In  the  SEM  literature  a  structural  model  refers  to  “autonomous”  (not  “derived”) 
relationships.  There  are  other  closely  related  concepts  of  a  structure.  One  such  concept 
refers  to  “deep  parameters,”  by  which  is  meant  technology  and  preference  parameters 
that  are  invariant  to  interventions. 

In  recent  years  an  alternative  usage  of  the  term  structure  has  emerged,  one  that  refers 
to  econometric  models  based  on  the  hypothesis  of  dynamic  stochastic  optimization  by 
rational  agents.  In  this  approach  the  starting  point  for  any  structural  estimation  prob¬ 
lem  is  the  first-order  necessary  conditions  that  define  the  agent’s  optimizing  behavior. 
For  example,  in  a  standard  problem  of  maximizing  utility  subject  to  constraints,  the 
behavioral  relations  are  the  deterministic  first-order  marginal  utility  conditions.  If  the 
relevant  functional  forms  are  explicitly  stated,  and  stochastic  errors  of  optimization  are 
introduced,  then  the  first-order  conditions  define  a  behavioral  model  whose  parameters 
characterize  the  utility  function  -  the  so-called  deep  or  policy-invariant  parameters. 
Examples  are  given  in  Sections  6.2.7  and  16.8.1. 

Two  features  of  this  highly  structural  approach  should  be  mentioned.  First,  they 
rely  on  a  priori  economic  theory  in  a  serious  manner.  Economic  theory  is  not  used 
simply  to  generate  a  list  of  relevant  variables  that  one  can  use  in  a  more  or  less  arbi¬ 
trarily  specified  functional  form.  Rather,  the  underlying  economic  theory  has  a  major 
(but  not  exclusive)  role  in  specification,  estimation,  and  inference.  The  second  feature 
is  that  identification,  specification,  and  estimation  of  the  resulting  model  can  be  very 
complicated,  because  the  agent’s  optimization  problem  is  potentially  very  complex. 
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especially  if  dynamic  optimization  under  uncertainty  is  postulated  and  discreteness 
and  discontinuities  are  present;  see  Rust  (1994). 


2.5.  Identification  Concepts 

The  goal  of  the  SEM  approach  is  to  consistently  estimate  (B,  E.  X)  and  conduct  statis¬ 
tical  inference.  An  important  precondition  for  consistent  estimation  is  that  the  model 
should  be  identified.  We  briefly  discuss  the  important  twin  concepts  of  observational 
equivalence  and  identifiability  in  the  context  of  parametric  models. 

Identification  is  concerned  with  determination  of  a  parameter  given  sufficient  ob¬ 
servations.  In  this  sense,  it  is  an  asymptotic  concept.  Statistical  uncertainty  necessarily 
affects  any  inference  based  on  a  finite  number  of  observations.  By  hypothetically  con¬ 
sidering  the  possibility  that  sufficient  number  of  observations  are  available,  it  is  pos¬ 
sible  to  consider  whether  it  is  logically  possible  to  determine  a  parameter  of  interest 
either  in  the  sense  of  its  point  value  or  in  the  sense  of  determining  the  set  of  which 
the  parameter  is  an  element.  Therefore,  identification  is  a  fundamental  consideration 
and  logically  occurs  prior  to  and  is  separate  from  statistical  estimation.  A  great  deal  of 
econometric  literature  on  identification  focuses  on  point  identification.  This  is  also  the 
emphasis  of  this  section.  However,  set  identification,  or  bounds  identification,  is  an 
important  approach  that  will  be  used  in  selected  places  in  this  book  (e.g.,  Chapters  25 
and  27;  see  Manski,  1995). 

Definition  2.3  ( Observational  Equivalence ):  Two  structures  of  a  model  defined 
as  joint  probability  distribution  function  Pr[x|0],  x  e  W,  6  e  0,  are  observa- 
tionally  equivalent  if  Pr[x| 0 1  ]  =  Pr[x|#2]  V  x  e  W. 

Less  formally,  if,  given  the  data,  two  structural  models  imply  identical  joint  proba¬ 
bility  distributions  of  the  variables,  then  the  two  structures  are  observationally  equiva¬ 
lent.  The  existence  of  multiple  observationally  equivalent  structures  implies  the  failure 
of  identification. 

Definition  2.4  ( Identification ):  A  structure  9°  is  identified  if  there  is  no  other 
observationally  equivalent  structure  in  0. 

A  simple  example  of  nonidentification  occurs  when  there  is  perfect  collinearity  be¬ 
tween  regressors  in  the  linear  regression  y  =  X(3  +  u.  Then  we  can  identify  the  linear 
combination  C/3,  where  rank[C]  <  rank[/3],  but  we  cannot  identify  /3  itself. 

This  definition  concerns  uniqueness  of  the  structure.  In  the  context  of  the  SEM 
we  have  given,  this  definition  means  that  identification  requires  that  there  is  a  unique 
triple  (B,  T,  X)  consistent  with  the  observed  data.  In  SEM,  as  in  other  cases,  identi¬ 
fication  involves  being  able  to  obtain  unique  estimates  of  structural  parameters  given 
the  sample  moments  of  the  data.  For  example,  in  the  case  of  the  reduced  form  (2.12), 
under  the  stated  assumptions  the  least-squares  estimator  provides  unique  estimates  of 
II,  that  is,  II  =  [Z'Z]  'Z'Y.  and  identification  of  B,  E  requires  that  there  is  a  solution 
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for  the  unknown  elements  of  F  and  B  from  the  equations  IT  +  FB  1  =  0,  given  a 
priori  restrictions  on  the  model.  A  unique  solution  implies  just  identification  of  the 
model. 

A  complete  model  is  said  to  be  identified  if  all  the  model  parameters  are  identified. 
It  is  possible  that  for  some  models  only  a  subset  of  parameters  is  identified.  In  some 
situations  it  may  be  important  to  be  able  to  identify  some  function  of  parameters,  and 
not  necessarily  all  the  individual  parameters.  Identification  of  a  function  of  parameters 
means  that  function  can  be  recovered  uniquely  from  F(W|@). 

How  does  one  ensure  that  the  structures  of  alternative  model  specifications  can  be 
“ruled  out”?  In  SEM  the  solution  to  this  problem  depends  on  augmenting  the  sample 
information  by  a  priori  restrictions  on  (B,  T,  £).  The  a  priori  restrictions  must  intro¬ 
duce  sufficient  additional  information  into  the  model  to  rule  out  the  existence  of  other 
observationally  equivalent  structures. 

The  need  for  a  priori  restrictions  is  demonstrated  by  the  following  argument.  First 
note  that  given  the  assumptions  of  Section  2.4.1  the  reduced  form,  defined  by  (II,  $2), 
is  always  unique.  Initially  suppose  there  are  no  restrictions  on  (B,  F,  £).  Next  suppose 
that  there  are  two  observationally  equivalent  structures  fB i ,  F | ,  Si)  and  (B2,  r2.  Si). 
Then 


n  =  -r1B71  ^-riB^1.  (2.19) 

Let  H  be  a  G  x  G  nonsingular  matrix.  Then  TiB^1  =  F]HH  'Bj  1  =  TiB2  ',  which 
means  that  Ti  =  TiH,  B2  =  B  (H .  Thus  the  second  structure  is  a  linear  transformation 
of  the  first. 

The  SEM  solution  to  this  problem  is  to  introduce  restrictions  on  (B,  T,  S)  such 
that  we  can  rule  out  the  existence  of  linear  transformations  that  lead  to  observation- 
ally  equivalent  structures.  In  other  words,  the  restrictions  on  (B,  T,  S)  must  be  such 
that  there  is  no  matrix  H  that  would  yield  another  structure  with  the  same  reduced 
form;  given  (II,  II)  there  will  be  a  unique  solution  to  the  equations  II  =  —  TB  1  and 

n  siB-'ysB-1. 

In  practice  a  variety  of  restrictions  can  be  imposed  including  (1)  normalizations, 
such  as  setting  diagonal  elements  of  B  equal  to  1 ,  (2)  zero  (exclusion)  and  linear  ho¬ 
mogeneous  and  inhomogeneous  restrictions,  and  (3)  covariance  and  inequality  restric¬ 
tions.  Details  of  the  necessary  and  sufficient  conditions  for  identification  in  linear  and 
nonlinear  models  can  be  found  in  many  texts  including  Sargan  (1988). 

Meaningful  imposition  of  identifying  restrictions  requires  that  the  a  priori  restric¬ 
tions  imposed  should  be  valid  a  posteriori.  This  idea  is  pursued  further  in  several  chap¬ 
ters  where  identification  issues  are  considered  (e.g.,  Section  6.9). 

Exclusion  restrictions  essentially  state  that  the  model  contains  some  variables  that 
have  zero  impact  on  some  endogenous  variables.  That  is,  certain  directions  of  causa¬ 
tion  are  ruled  out  a  priori.  This  makes  it  possible  to  identify  other  directions  of  cau¬ 
sation.  For  example,  in  the  simple  two-variable  example  given  earlier,  z  1  did  not  enter 
the  yi -equation,  making  it  possible  to  identify  the  direct  impact  of  yo  on  vi.  Although 
exclusion  restrictions  are  the  simplest  to  apply,  in  parametric  models  identification  can 
also  be  secured  by  inequality  restrictions  and  covariance  restrictions. 


30 


2.7.  POTENTIAL  OUTCOME  MODEL 


If  there  are  no  restrictions  on  E,  and  the  diagonal  elements  of  B  are  normalized  to 
1 ,  then  a  necessary  condition  for  identification  is  the  order  condition,  which  states 
that  the  number  of  excluded  exogenous  variables  must  at  least  equal  the  number  of 
included  endogenous  variables.  A  sufficient  condition  is  the  rank  condition  given  in 
many  texts  that  ensures  for  the  y'th  equation  parameters  Iir,  =  B ,  yields  a  unique 
solution  for  (T;  ,  B;  )  given  II. 

Given  identification,  the  term  just  (exact)  identification  refers  to  the  case  when 
the  order  condition  is  exactly  satisfied;  overidentification  refers  to  the  case  when  the 
number  of  restrictions  on  the  system  exceeds  that  required  for  exact  identification. 

Identification  in  nonlinear  SEM  has  been  discussed  in  Sargan  (1988),  who  also 
gives  references  to  earlier  related  work. 


2.6.  Single-Equation  Models 

Without  loss  of  generality  consider  the  first  equation  of  a  linear  SEM  subject  to  nor¬ 
malization  /in  =  1.  Let  y  =  yi,  let  yi  denote  the  endogenous  components  of  y  other 
than  yi,  and  let  zi  denote  the  exogenous  components  of  z  with 

y  =  yja  +  zj7  +  u.  (2.20) 

Many  studies  skip  the  formal  steps  involved  in  going  from  a  system  to  a  single  equation 
and  begin  by  writing  the  regression  equation 

y  —  x'(3  +  u , 

where  some  components  of  x  are  endogenous  (implicitly  yi)  and  others  are  exogenous 
(implicitly  zi).  The  focus  lies  then  on  estimating  the  impact  of  changes  in  key  regres- 
sor(s)  that  may  be  endogenous  or  exogenous,  depending  on  the  assumptions.  Instru¬ 
mental  variable  or  two-stage  least-squares  estimation  is  the  most  obvious  estimation 
strategy  (see  Sections  4.8,  6.4,  and  6.5). 

In  the  SEM  approach  it  is  natural  to  specify  at  least  some  of  the  remaining  equa¬ 
tions  in  the  model,  even  if  they  are  not  the  focus  of  inquiry.  Suppose  yi  has  dimen¬ 
sion  1.  Then  the  first  possibility  is  to  specify  the  structural  equation  for  yi  and  for 
the  other  endogenous  variables  that  may  appear  in  this  structural  equation  for  yi. 
A  second  possibility  is  to  specify  the  reduced  form  equation  for  yi.  This  will  show 
exogenous  variables  that  affect  yi  but  do  not  directly  affect  y.  An  advantage  is  that 
in  such  a  setting  instrumental  variables  emerge  naturally.  However,  in  recent  empir¬ 
ical  work  using  instrumental  variables  in  a  single-equation  setting,  even  the  formal 
step  of  writing  down  a  reduced  form  for  the  endogenous  right-hand-side  variable  is 
avoided. 


2.7.  Potential  Outcome  Model 

Motivation  for  causal  inference  in  econometric  models  is  especially  strong  when  the 
focus  is  on  the  impact  of  public  policy  and/or  private  decision  variables  on  some 
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specific  outcomes.  Specific  examples  include  the  impact  of  transfer  payments  on  labor 
supply,  the  impact  of  class  size  on  student  learning,  and  the  impact  of  health  insurance 
on  utilization  of  health  care.  In  many  cases  the  causal  variables  themselves  reflect 
individual  decisions  and  hence  are  potentially  endogenous.  When,  as  is  usually  the 
case,  econometric  estimation  and  inference  are  based  on  observational  data,  iden¬ 
tification  of  and  inference  on  causal  parameters  pose  many  challenges.  These  chal¬ 
lenges  can  become  potentially  less  serious  if  the  causal  issues  are  addressed  using 
data  from  a  controlled  social  experiment  with  a  proper  statistical  design.  Although 
such  experiments  have  been  implemented  (see  Section  3.3  for  examples  and  details) 
they  are  generally  expensive  to  organize  and  run.  Therefore,  it  is  more  attractive 
to  implement  causal  modeling  using  data  generated  by  a  natural  experiment  or  in 
a  quasi-experimental  setting.  Section  3.4  discusses  the  pros  and  cons  of  these  data 
structures;  but  for  present  purposes  one  should  think  of  a  natural  or  quasi  experi¬ 
ment  as  a  setting  in  which  some  causal  variable  changes  exogenously  and  indepen¬ 
dently  of  other  explanatory  variables,  making  it  relatively  easier  to  identify  causal 
parameters. 

A  major  obstacle  for  causality  modeling  stems  from  the  fundamental  problem  of 
causal  inference  (Holland,  1986).  Let  X  be  the  hypothesized  cause  and  Y  the  outcome. 
By  manipulating  the  value  of  X  we  can  change  the  value  of  Y.  Suppose  the  value  of  X 
is  changed  from  x  \  to  xi.  Then  a  measure  of  the  causal  impact  of  the  change  on  Y  is 
formed  by  comparing  the  two  values  of  Y :  yi,  which  results  from  the  change,  and  yi, 
which  would  have  resulted  had  no  change  in  x  occurred.  However,  if  X  did  change, 
then  the  value  of  Y,  in  the  absence  of  the  change,  would  not  be  observed.  Hence  noth¬ 
ing  more  can  be  said  about  causal  impact  without  some  hypothesis  about  what  value 
Y  would  have  assumed  in  the  absence  of  the  change  in  X.  The  latter  is  referred  to 
as  a  counterfactual,  which  means  hypothetical  unobserved  value.  Briefly  stated,  all 
causal  inference  involves  comparison  of  a  factual  with  a  counterfactual  outcome.  In 
the  conventional  econometric  model  (e.g.,  SEM)  a  counterfactual  does  not  need  to  be 
explicitly  stated. 

A  relatively  newer  strand  in  the  microeconometric  literature  -  program  evalua¬ 
tion  or  treatment  evaluation  -  provides  a  statistical  framework  for  the  estimation 
of  causal  parameters.  In  the  statistical  literature  this  framework  is  also  known  as  the 
Rubin  causal  model  (RCM)  in  recognition  of  a  key  early  contribution  by  Rubin 
(1974,  1978),  who  in  turn  cites  R.A.  Fisher  as  originator  of  the  approach.  Al¬ 
though,  following  recent  convention,  we  refer  to  this  as  the  Rubin  causal  model, 
Neyman  (Splawa-Neyman)  also  proposed  a  similar  statistical  model  in  an  article 
published  in  Polish  in  1923;  see  Neyman  (1990).  Models  involving  counterfactuals 
have  been  independently  developed  in  econometrics  following  the  seminal  work  of 
Roy  (1951).  In  the  remainder  of  this  section  the  salient  features  of  RCM  will  be 
analyzed. 

Causal  parameters  based  on  counterfactuals  provide  statistically  meaningful  and 
operational  definitions  of  causality  that  in  some  respects  differ  from  the  traditional 
Cowles  foundation  definition.  First,  in  ideal  settings  this  framework  leads  to  consider¬ 
able  simplicity  of  econometric  methods.  Second,  this  framework  typically  focuses  on 
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the  fewer  causal  parameters  that  are  thought  to  be  most  relevant  to  policy  issues  that 
are  examined.  This  contrasts  with  the  traditional  econometric  approach  that  focuses 
simultaneously  on  all  structural  parameters.  Third,  the  approach  provides  additional 
insights  into  the  properties  of  causal  parameters  estimated  by  the  standard  structural 
methods. 


2.7.1.  The  Rubin  Causal  Model 


The  term  “treatment”  is  used  interchangeably  with  “cause.”  In  medical  studies  of  new 
drug  evaluation,  involving  groups  of  those  who  receive  the  treatment  and  those  who 
do  not,  the  drug  response  of  the  treated  is  compared  with  that  of  the  untreated.  A  mea¬ 
sure  of  causal  impact  is  the  average  difference  in  the  outcomes  of  the  treated  and  the 
nontreated  groups.  In  economics,  the  term  treatment  is  used  very  broadly.  Essentially 
it  covers  variables  whose  impact  on  some  outcome  is  the  object  of  study.  Examples  of 
treatment-outcome  pairs  include  schooling  and  wages,  class  size  and  scholastic  per¬ 
formance,  and  job  training  and  earnings.  Note  that  a  treatment  need  not  be  exogenous, 
and  in  many  situations  it  is  an  endogenous  (choice)  variable. 

Within  the  framework  of  a  potential  outcome  model  (POM),  which  assumes  that 
every  element  of  the  target  population  is  potentially  exposed  to  the  treatment,  the  triple 
(vim  yo />  Dj),  i  =  1, . . . ,  N,  forms  the  basis  of  treatment  evaluation.  The  categorical 
variable  D  takes  the  values  1  and  0,  respectively,  when  treatment  is  or  is  not  received; 
Vj,  measures  the  response  for  individual  i  receiving  treatment,  and  yo,  measures  that 
when  not  receiving  treatment.  That  is. 


yu  if  A  =  l, 
y0i  if  Dj  —  0 


Since  the  receipt  and  nonreceipt  of  treatment  are  mutually  exclusive  states  for  indi¬ 
vidual  i,  only  one  of  the  two  measures  is  available  for  any  given  i,  the  unavailable 
measure  being  the  counterfactual.  The  effect  of  the  cause  D  on  outcome  of  individual 
i  is  measured  by  (yu  —  yo,).  The  average  causal  effect  of  £>,  =  1,  relative  to  I),  =  0, 
is  measured  by  the  average  treatment  effect  (ATE): 


ATE  =  E[y|D  =  1]  -  E[y|£>  =  0],  (2.22) 

where  expectations  are  with  respect  to  the  probability  distribution  over  the  target  pop¬ 
ulation.  Unlike  the  conventional  structural  model  that  emphasizes  marginal  effects,  the 
POM  framework  emphasizes  ATE  and  parameters  related  to  it. 

The  experimental  approach  to  the  estimation  of  ATE-type  parameters  involves  a 
random  assignment  of  treatment  followed  by  a  comparison  of  the  outcomes  with  a 
set  of  nontreated  cases  that  serve  as  controls.  Such  an  experimental  design  is  explained 
in  greater  detail  in  Chapter  3.  Random  assignment  implies  that  individuals  exposed  to 
treatment  are  chosen  randomly,  and  hence  the  treatment  assignment  does  not  depend 
on  the  outcome  and  is  uncorrelated  with  the  attributes  of  treated  subjects.  Two  ma¬ 
jor  simplifications  follow.  The  treatment  variable  can  be  treated  as  exogenous  and  its 
coefficient  in  a  linear  regression  will  not  suffer  from  omitted  variable  bias  if  some 
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relevant  variables  are  unavoidably  omitted  from  the  regression.  Under  certain  condi¬ 
tions,  discussed  at  greater  length  in  Chapters  3  and  25,  the  mean  difference  between 
the  outcomes  of  the  treated  and  the  control  groups  will  provide  an  estimate  of  ATE. 
The  payoff  to  the  well-designed  experiment  is  the  relative  simplicity  with  which  causal 
statements  can  be  made.  Of  course,  to  ensure  high  statistical  precision  for  the  treatment 
effect  estimate,  one  should  still  control  for  those  attributes  that  also  independently  in¬ 
fluence  the  outcomes. 

Because  random  assignment  of  treatment  is  generally  not  feasible  in  economics, 
estimation  of  ATE-type  parameters  must  be  based  on  observational  data  generated 
under  nonrandom  treatment  assignment.  Then  the  consistent  estimation  of  ATE  will 
be  threatened  by  several  complications  that  include,  for  example,  possible  correlation 
between  the  outcomes  and  treatment,  omitted  variables,  and  endogeneity  of  the  treat¬ 
ment  variable.  Some  econometricians  have  suggested  that  the  absence  of  randomiza¬ 
tion  comprises  the  major  impediment  to  convincing  statistical  inference  about  causal 
relationships. 

The  potential  outcome  model  can  lead  to  causal  statements  if  the  counterfactual  can 
be  clearly  stated  and  made  operational.  An  explicit  statement  of  the  counterfactual, 
with  a  clear  implication  of  what  should  be  compared,  is  an  important  feature  of  this 
model.  If,  as  may  be  the  case  with  observational  data,  there  is  lack  of  a  clear  distinc¬ 
tion  between  observed  and  counterfactual  quantities,  then  the  answer  to  the  question 
of  who  is  affected  by  the  treatment  remains  unclear.  ATE  is  a  measure  that  weights  and 
combines  marginal  responses  of  specific  subpopulations.  Specific  assumptions  are  re¬ 
quired  to  operationalize  the  counterfactual.  Information  on  both  treated  and  untreated 
units  that  can  be  observed  is  needed  to  estimate  ATE.  For  example,  it  is  necessary  to 
identify  the  untreated  group  that  proxies  the  treated  group  if  the  treatment  were  not 
applied.  It  is  not  necessarily  true  that  this  step  can  always  be  implemented.  The  exact 
way  in  which  the  treated  are  selected  involves  issues  of  sampling  design  that  are  also 
discussed  in  Chapters  3  and  25. 

A  second  useful  feature  of  the  POM  is  that  it  identifies  opportunities  for  causal 
modeling  created  by  natural  or  quasi-experiments.  When  data  are  generated  in  such 
settings,  and  provided  certain  other  conditions  are  satisfied,  causal  modeling  can  occur 
without  the  full  complexities  of  the  SEM  framework.  This  issue  is  analyzed  further  in 
Chapters  3  and  25. 

Third,  unlike  the  structural  form  of  the  SEM  where  all  variables  other  than  that  be¬ 
ing  explained  can  be  labeled  as  “causes,”  in  the  POM  not  all  explanatory  variables  can 
be  regarded  as  causal.  Many  are  simply  attributes  of  the  units  that  must  be  controlled 
for  in  regression  analysis,  and  attributes  are  not  causes  (Holland,  1986).  Causal  param¬ 
eters  must  relate  to  variables  that  are  actually  or  potentially,  and  directly  or  indirectly, 
subject  to  intervention. 

Finally,  identifiability  of  the  ATE  parameter  may  be  an  easier  research  goal  and 
hence  feasible  in  situations  where  the  identifiability  of  a  full  SEM  may  not  be  (Angrist, 
2001).  Whether  this  is  so  has  to  be  determined  on  a  case-by-case  basis.  However, 
many  available  applications  of  the  POM  typically  employ  a  limited,  rather  than  full, 
information  framework.  However,  even  within  the  SEM  framework  the  use  of  a  limited 
information  framework  is  also  feasible,  as  was  previously  discussed. 
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2.8.  Causal  Modeling  and  Estimation  Strategies 

In  this  section  we  briefly  sketch  some  of  the  ways  in  which  econometricians  approach 
the  modeling  of  causal  relationships.  These  approaches  can  be  used  within  both  SEM 
and  POM  frameworks,  but  they  are  typically  identified  with  the  former. 


2.8.1.  Identification  Frameworks 
Full-Information  Structural  Models 

One  variant  of  this  approach  is  based  on  the  parametric  specification  of  the  joint  distri¬ 
bution  of  endogenous  variables  conditional  on  exogenous  variables.  The  relationships 
are  not  necessarily  derived  from  an  optimizing  model  of  behavior.  Parametric  restric¬ 
tions  are  placed  to  ensure  identification  of  the  model  parameters  that  are  the  target 
of  statistical  inference.  The  entire  model  is  estimated  simultaneously  using  maximum 
likelihood  or  moments-based  estimation.  We  call  this  approach  the  full-information 
structural  approach.  For  well-specified  models  this  is  an  attractive  approach  but  in 
general  its  potential  limitation  is  that  it  may  contain  some  equations  that  are  poorly 
specified.  Under  joint  estimation  the  effects  of  localized  misspecification  may  also 
affect  other  estimates. 

Statistically  we  may  interpret  the  full-information  approach  as  one  in  which  the 
joint  probability  distribution  of  endogenous  variables,  given  the  exogenous  variables, 
forms  the  basis  of  inference  about  causality.  The  jointness  may  derive  from  contem¬ 
poraneous  or  dynamic  interdependence  between  endogenous  variables  and/or  the  dis¬ 
turbances  on  the  equations. 


Fimited-Information  Structural  Models 

By  contrast,  when  the  central  object  of  statistical  inference  is  estimation  of  one  or  two 
key  parameters,  a  limited-information  approach  may  be  used.  A  feature  of  this  ap¬ 
proach  is  that,  although  one  equation  is  the  focus  of  inference,  the  joint  dependence 
between  it  and  other  endogenous  variables  is  exploited.  This  requires  that  explicit  as¬ 
sumptions  are  made  about  some  features  of  the  model  that  are  not  the  main  object  of 
inference.  Instrumental  variable  methods,  sequential  multistep  methods,  and  limited 
information  maximum  likelihood  methods  are  specific  examples  of  this  approach.  To 
implement  the  approach  one  typically  works  with  one  (or  more)  structural  equations 
and  some  implicitly  or  explicitly  stated  reduced  form  equations.  This  contrasts  with  the 
full-information  approach  where  all  equations  are  structural.  The  limited-information 
approach  is  often  computationally  more  tractable  than  the  full-information  one. 

Statistically  we  may  interpret  the  limited-information  approach  as  one  in  which  the 
joint  distribution  is  factored  into  the  product  of  a  conditional  model  for  the  endogenous 
variable(s)  of  interest,  say  yi,  and  a  marginal  model  for  other  endogenous  variables, 
say  y2 ,  which  are  in  the  set  of  the  conditioning  variables,  as  in 


/(y|x,  0)  =  g(yi|x,y2,  6>i)/r(y2|x,  02 ),  0  e  0. 
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Modeling  may  be  based  on  the  component  g(yi|x, y2,  6\)  with  minimal  attention  to 
h( y2|x,  02)  if  02  are  regarded  as  nuisance  parameters.  Of  course,  such  a  factorization 
is  not  unique,  and  hence  the  limited-information  approach  can  have  several  variants. 


Identified  Reduced  Forms 

A  third  variant  of  the  SEM  approach  works  with  an  identified  reduced  form.  Here  too 
one  is  interested  in  structural  parameters.  However,  it  may  be  convenient  to  estimate 
these  from  the  reduced  form  subject  to  restrictions.  In  time  series  the  identified  vector 
autoregressions  provide  an  example. 


2.8.2.  Identification  Strategies 

There  are  numerous  potential  ways  in  which  the  identification  of  key  model  parameters 
can  be  jeopardized.  Omitted  variables,  functional  form  misspecifications,  measure¬ 
ment  errors  in  explanatory  variables,  using  data  unrepresentative  of  the  population,  and 
ignoring  endogeneity  of  explanatory  variables  are  leading  examples.  Microeconomet¬ 
rics  contains  many  specific  examples  of  how  these  challenges  can  be  tackled.  Angrist 
and  Krueger  (2000)  provide  a  comprehensive  survey  of  popular  identification  strate¬ 
gies  in  labor  economics,  with  emphasis  on  the  POM  framework.  Most  of  the  issues  are 
developed  elsewhere  in  the  book,  but  a  brief  mention  is  made  here. 


Exogenization 

Data  are  sometimes  generated  by  natural  experiments  and  quasi-experiments.  The  idea 
here  is  simply  that  a  policy  variable  may  exogenously  change  for  some  subpopulation 
while  it  remains  the  same  for  other  subpopulations.  For  example,  minimum  wage  laws 
in  one  state  may  change  while  they  remain  unchanged  in  a  neighboring  state.  Such 
events  naturally  create  treatment  and  control  groups.  If  the  natural  experiment  ap¬ 
proximates  a  randomized  treatment  assignment,  then  exploiting  such  data  to  estimate 
structural  parameters  can  be  simpler  than  estimation  of  a  larger  simultaneous  equa¬ 
tions  model  with  endogenous  treatment  variables.  It  is  also  possible  that  the  treatment 
variable  in  a  natural  experiment  can  be  regarded  as  exogenous,  but  the  treatment  itself 
is  not  randomly  assigned. 


Elimination  of  Nuisance  Parameters 

Identification  may  be  threatened  by  the  presence  of  a  large  number  of  nuisance  param¬ 
eters.  For  example,  in  a  cross-section  regression  model  the  conditional  mean  function 
E[y,  |x,]  may  involve  an  individual  specific  fixed  effect  a,-,  assumed  to  be  correlated 
with  the  regression  error.  This  effect  cannot  be  identified  without  many  observations 
on  each  individual  (i.e.,  panel  data).  However,  with  just  a  short  panel  it  could  be  elim¬ 
inated  by  a  transformation  of  the  model.  Another  example  is  the  presence  of  timein- 
variant  unobserved  exogenous  variables  that  may  be  common  to  groups  of  individuals. 
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An  example  of  a  transformation  that  eliminates  fixed  effects  is  taking  differences  and 
working  with  the  differences-in-differences  form  of  the  model. 


Controlling  for  Confounders 

When  variables  are  omitted  from  a  regression,  and  when  omitted  factors  are  correlated 
with  the  included  variables,  a  confounding  bias  results.  For  example,  in  a  regression 
with  earnings  as  a  dependent  variable  and  schooling  as  an  explanatory  variable,  indi¬ 
vidual  ability  may  be  regarded  as  an  omitted  variable  because  only  imperfect  proxies 
for  it  are  typically  available.  This  means  that  potentially  the  coefficient  of  the  school¬ 
ing  variable  may  not  be  identified.  One  possible  strategy  is  to  introduce  control  vari¬ 
ables  in  the  model;  the  general  approach  is  called  the  control  function  approach. 
These  variables  are  an  attempt  to  approximate  the  influence  of  the  omitted  variables. 
For  example,  various  types  of  scholastic  achievement  scores  may  serve  as  controls  for 
ability. 


Creating  Synthetic  Samples 

Within  the  POM  framework  a  causal  parameter  may  be  unidentified  because  no  suit¬ 
able  comparison  or  control  group  can  provide  the  benchmark  for  estimation.  A  poten¬ 
tial  solution  is  to  create  a  synthetic  sample  that  includes  a  comparison  group  that  are 
proxies  for  controls.  Such  a  sample  is  created  by  matching  (discussed  in  Chapter  25). 
If  treated  samples  can  be  augmented  by  well-matched  controls,  then  identification  of 
causal  parameters  can  be  achieved  in  the  sense  that  a  parameter  related  to  ATE  can  be 
estimated. 


Instrumental  Variables 

If  identification  is  jeopardized  because  the  treatment  variable  is  endogenous,  then  a 
standard  solution  is  to  use  valid  instrumental  variables.  This  is  easier  said  than  done. 
The  choice  of  the  instrumental  variable  as  well  as  the  interpretation  of  the  results 
obtained  must  be  done  carefully  because  the  results  may  be  sensitive  to  the  choice  of 
instruments.  The  approach  is  analyzed  in  Sections  4.8,  4.9,  6.4,  6.5,  and  25.7,  as  well 
as  in  several  other  places  in  the  book  as  the  need  arises.  Again  a  natural  experiment 
may  provide  a  valid  instrument. 


Reweighting  Samples 

Sample-based  inferences  about  the  population  are  only  valid  if  the  sample  data  are 
representative  of  the  population.  The  problem  of  sample  selection  or  biased  sampling 
arises  when  the  sample  data  are  not  representative,  in  which  case  the  population  param¬ 
eters  are  not  identified.  This  problem  can  be  approached  as  one  that  requires  correction 
for  sample  selection  (Chapter  16)  or  one  that  requires  reweighting  of  the  sample  infor¬ 
mation  (Chapter  24). 
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2.9.  Bibliographic  Notes 

2.1  The  2001  Nobel  lectures  by  Heckman  and  McFadden  are  excellent  sources  for  both  his¬ 
torical  and  current  information  about  the  developments  in  microeconometrics.  Heckman's 
lecture  is  remarkable  for  its  comprehensive  scope  and  offers  numerous  insights  into  many 
aspects  of  microeconometrics.  His  discussion  of  heterogeneity  has  many  points  of  contact 
with  several  topics  covered  in  this  book. 

2.2  Marschak  (1953)  gives  a  classic  statement  of  the  primacy  of  structural  modeling  for  policy 
evaluation.  He  makes  an  early  mention  of  the  idea  of  parameter  invariance. 

2.3  Engle,  Hendry,  and  Richard  (1983)  provide  definitions  of  weak  and  strong  exogeneity  in 
terms  of  the  distribution  of  observable  variables.  They  make  links  with  previous  literature 
on  exogeneity  concepts. 

2.4  and  2.5  The  term  "identification"  was  used  by  Koopmans  (1949).  Point  identification  in 
linear  parametric  models  is  covered  in  most  textbooks  including  those  by  Sargan  (1988) 
who  gives  a  comprehensive  and  succint  treatment,  Davidson  and  MacKinnon  (2004),  and 
Greene  (2003).  Gourieroux  and  Monfort  (1989,  chapter  3.4)  provide  a  different  perspective 
using  Fisher  and  Kullback  information  measures.  Bounds  identification  in  several  leading 
cases  is  developed  in  Manski  (1995). 

2.6  Heckman  (2000)  provides  a  historical  overview  and  modern  interpretations  of  causality  in 
the  traditional  econometric  model.  Causality  concepts  within  the  POM  framework  are  care¬ 
fully  and  incisively  analyzed  by  Holland  (1986),  who  also  relates  them  to  other  definitions. 
A  sample  of  the  statisticians’  viewpoints  of  causality  from  a  historical  perspective  can  be 
found  in  Freedman  (1999).  Pearl  (2000)  gives  insightful  schematic  exposition  of  the  idea 
of  “treating  causation  as  a  summary  of  behavior  under  interventions,”  as  well  as  numerous 
problems  of  inferring  causality  in  a  nonexperimental  situation. 

2.7  Angrist  and  Krueger  (1999)  survey  solutions  to  identification  pitfalls  with  examples  from 
labor  economics. 
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Microeconomic  Data  Structures 


3.1.  Introduction 

This  chapter  surveys  issues  concerning  the  potential  usefulness  and  limitations  of  dif¬ 
ferent  types  of  microeconomic  data.  By  far  the  most  common  data  structure  used  in 
microeconometrics  is  survey  or  census  data.  These  data  are  usually  called  observa¬ 
tional  data  to  distinguish  them  from  experimental  data. 

This  chapter  discusses  the  potential  limitation  of  the  aforementioned  data  struc¬ 
tures.  The  inherent  limitations  of  observational  data  may  be  further  compounded  by 
the  manner  in  which  the  data  are  collected,  that  is,  by  the  sample  frame  (the  way  the 
sample  is  generated),  sample  design  (simple  random  sample  versus  stratified  random 
sample),  and  sample  scope  (cross-section  versus  longitudinal  data).  Hence  we  also 
discuss  sampling  issues  in  connection  with  the  use  of  observational  data.  Some  of  this 
terminology  is  new  at  this  stage  but  will  be  explained  later  in  this  chapter. 

Microeconometrics  goes  beyond  the  analysis  of  survey  data  under  the  assumptions 
of  simple  random  sampling.  This  chapter  considers  extensions.  Section  3.2  outlines 
the  structure  of  multistage  sample  surveys  and  some  common  forms  of  departure  from 
random  sampling;  a  more  detailed  analysis  of  their  statistical  implications  is  provided 
in  later  chapters.  It  also  considers  some  commonly  occurring  complications  that  result 
in  the  data  not  being  necessarily  representative  of  the  population.  Given  the  deficien¬ 
cies  of  observational  data  in  estimating  causal  parameters,  there  has  been  an  increased 
attempt  at  exploiting  experimental  and  quasi-experimental  data  and  frameworks.  Sec¬ 
tion  3.3  examines  the  potential  of  data  from  social  experiments.  Section  3.4  considers 
the  modeling  opportunities  arising  from  a  special  type  of  observational  data,  generated 
under  quasi-experimental  conditions,  that  naturally  provide  treated  and  untreated  sub¬ 
jects  and  hence  are  called  natural  experiments.  Section  3.5  covers  practical  issues  of 
microdata  management. 
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3.2.  Observational  Data 

The  major  source  of  microeconomic  observational  data  is  surveys  of  households,  firms, 
and  government  administrative  data.  Census  data  can  also  be  used  to  generate  samples. 
Many  other  samples  are  often  generated  at  points  of  contact  between  transacting  par¬ 
ties.  For  example,  marketing  data  may  be  generated  at  the  point  of  sale  and/or  surveys 
among  (actual  or  potential)  purchasers.  The  Internet  (e.g.,  online  auctions)  is  also  a 
source  of  data. 

There  is  a  huge  literature  on  sample  surveys  from  the  viewpoint  of  both  survey 
statisticians  and  users  of  survey  data.  The  first  discusses  how  to  sample  from  the  pop¬ 
ulation  and  the  results  from  different  sampling  designs,  and  the  second  deals  with  the 
issues  of  estimation  and  inference  that  arise  when  survey  data  are  collected  using  dif¬ 
ferent  sampling  designs.  A  key  issue  is  how  well  the  sample  represents  the  population. 
This  chapter  deals  with  both  strands  of  the  literature  in  an  introductory  fashion.  Many 
additional  details  are  given  in  Chapter  24. 


3.2.1.  Nature  of  Survey  Data 

The  term  observational  data  usually  refers  to  survey  data  collected  by  sampling  the 
relevant  population  of  subjects  without  any  attempt  to  control  the  characteristics  of 
the  sampled  data.  Let  t  denote  the  time  subscript,  let  w  denote  a  set  of  variables 
of  interest.  In  the  present  context  t  can  be  a  point  in  time  or  time  interval.  Let 
S,  denote  a  sample  from  population  probability  distribution  F(w,  \0,y.  S,  is  a  draw 
from  F( yv,\9t),  where  0  is  a  parameter  vector.  The  population  should  be  thought 
of  as  a  set  of  points  with  characteristics  of  interest,  and  for  simplicity  we  assume 
that  the  form  of  the  probability  distribution  F  is  known.  A  simple  random  sam¬ 
pling  scheme  allows  every  element  of  the  population  to  have  an  equal  probability  of 
being  included  in  the  sample.  More  complex  sampling  schemes  will  be  considered 
later. 

The  abstract  concept  of  a  stationary  population  provides  a  useful  benchmark.  If 
the  moments  of  the  characteristics  of  the  population  are  constant,  then  we  can  write 
0t  =  6,  for  all  t.  This  is  a  strong  assumption  because  it  implies  that  the  moments  of 
the  characteristics  of  the  population  are  time-invariant.  For  example,  the  age-sex  dis¬ 
tribution  should  be  constant.  More  realistically,  some  population  characteristics  would 
not  be  constant.  To  handle  such  a  possibility,  (the  parameters  of)  each  population  may 
be  regarded  as  a  draw  from  a  superpopulation  with  constant  characteristics.  Specif¬ 
ically,  we  think  of  each  6,  as  a  draw  from  a  probability  distribution  with  constant 
(hyper)parameter  0.  The  terms  superpopulation  and  hyperparameters  occur  frequently 
in  the  literature  on  hierarchical  models  discussed  in  Chapter  24.  Additional  complica¬ 
tions  arise  if  6,  has  an  evolutionary  component,  for  example  through  dependence  on 
t.  or  if  successive  values  are  interdependent.  Using  hierarchical  models,  discussed  in 
Chapters  1 3  and  26 ,  provides  one  approach  for  modeling  the  relation  between  hyper¬ 
parameters  and  subpopulation  characteristics. 
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3.2.2.  Simple  Random  Samples 

As  a  benchmark  for  subsequent  discussion,  consider  simple  random  sampling  in  which 
the  probability  of  sampling  unit  i  from  a  population  of  size  N,  with  N  large,  is  1/A  for 
all  i.  Partition  w  as  [y  :  x].  Suppose  our  interest  is  in  modeling  y,  a  possibly  vector¬ 
valued  outcome  variable,  conditional  on  the  exogenous  covariate  vector  x,  whose  joint 
distribution  is  denoted  fj(y,  x).  This  can  be  always  be  factored  as  the  product  of  the 
conditional  distribution  fc(j |x,  9)  and  the  marginal  distribution  /M(x): 

fAy,x)  =  fc(y\x,9)fM(x).  (3.1) 

Simple  random  sampling  involves  drawing  the  (y,  x)  combinations  uniformly  from 
the  entire  population. 


3.2.3.  Multistage  Surveys 

One  alternative  is  a  stratified  multistage  cluster  sampling,  also  referred  to  as  a  com¬ 
plex  survey  method.  Large-scale  surveys  like  the  Current  Population  Survey  (CPS) 
and  the  Panel  Survey  of  Income  Dynamics  (PSID)  take  this  approach.  Section  24.2 
provides  additional  detail  on  the  structure  of  the  CPS. 

The  complex  survey  design  has  advantages.  It  is  more  cost  effective  because  it 
reduces  geographical  dispersion,  and  it  becomes  possible  to  sample  certain  subpop¬ 
ulations  more  intensively.  For  example,  “oversampling”  of  small  subpopulations  ex¬ 
hibiting  some  relevant  characteristic  becomes  feasible  whereas  a  random  sample  of  the 
population  would  produce  too  few  observations  to  support  reliable  results.  A  disadvan¬ 
tage  is  that  stratified  sampling  will  reduce  interindividual  variation,  which  is  essential 
for  greater  precision. 

The  sample  survey  literature  focuses  on  multistage  surveys  that  sequentially  parti¬ 
tion  the  population  into  the  following  categories: 

1.  Strata:  Nonoverlapping  subpopulations  that  exhaust  the  population. 

2.  Primary  sampling  units  (PSUs):  Nonoverlapping  subsets  of  the  strata. 

3.  Secondary  sampling  units  (SSUs):  Sub-units  of  the  PSU,  which  may  in  turn  be  parti¬ 
tioned,  and  so  on. 

4.  Ultimate  sampling  unit  (USU):  The  final  unit  chosen  for  interview,  which  could  be  a 
household  or  a  collection  of  households  (a  segment). 

As  an  example,  the  strata  may  be  the  various  states  or  provinces  in  a  country,  the 
PSU  may  be  regions  within  the  state  or  province,  and  the  USU  may  be  a  small  cluster 
of  households  in  the  same  neighborhood. 

Usually  all  strata  are  surveyed  so  that,  for  example,  all  states  will  be  included  in 
the  sample  with  certainty.  But  not  all  of  the  PSUs  and  their  subdivisions  are  surveyed, 
and  they  may  be  sampled  at  different  rates.  In  two-stage  sampling  the  surveyed  PSUs 
are  drawn  at  random  and  the  USU  is  then  drawn  at  random  from  the  selected  PSUs.  In 
multistage  sampling  intermediate  sampling  units  such  as  SSUs  also  appear. 
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A  consequence  of  these  sampling  methods  is  that  different  households  will  have 
different  probabilities  of  being  sampled.  The  sample  is  then  unrepresentative  of  the 
population.  Many  surveys  provide  sampling  weights  that  are  intended  to  be  inversely 
proportional  to  the  probability  of  being  sampled,  in  which  case  these  weights  can  be 
used  to  obtain  unbiased  estimators  of  population  characteristics. 

Survey  data  may  be  clustered  due  to,  for  example,  sampling  of  many  households 
in  the  same  small  neighborhood.  Observations  in  the  same  cluster  are  likely  to  be  de¬ 
pendent  or  correlated  because  they  may  depend  on  some  observable  or  unobservable 
factor  that  could  affect  all  observations  in  a  stratum.  For  example,  a  suburb  may  be 
dominated  by  high-income  households  or  by  households  that  are  relatively  homoge¬ 
neous  in  some  dimension  of  their  preferences.  Data  from  these  households  will  tend 
to  be  correlated,  at  least  unconditionally,  though  it  is  possible  that  such  correlation 
is  negligible  after  conditioning  on  observable  characteristics  of  the  households.  Sta¬ 
tistical  inference  ignoring  correlation  between  sampled  observations  yields  erroneous 
estimates  of  variances  that  are  smaller  than  those  from  the  correct  formula.  These  is¬ 
sues  are  covered  in  greater  depth  in  Section  24.5.  Two-stage  and  multistage  samples 
potentially  further  complicate  the  computation  of  standard  errors. 

In  summary,  (1)  stratification  with  different  sampling  rates  within  strata  means  that 
the  sample  is  unrepresentative  of  the  population;  (2)  sampling  weights  inversely  pro¬ 
portional  to  the  probability  of  being  sampled  can  be  used  to  obtain  unbiased  estimation 
of  population  characteristics;  and  (3)  clustering  may  lead  to  correlation  of  observations 
and  understatement  of  the  true  standard  errors  of  estimators  unless  appropriate  adjust¬ 
ments  are  made. 


3.2.4.  Biased  Samples 

If  a  random  sample  is  drawn  then  the  probability  distribution  for  the  data  is  the  same 
as  the  population  distribution.  Certain  departures  from  random  sampling  cause  a  di¬ 
vergence  between  the  two;  this  is  referred  to  as  biased  sampling.  The  data  distribution 
differs  from  the  population  distribution  in  a  manner  that  depends  on  the  nature  of  the 
deviation  from  random  sampling.  Deviation  from  random  sampling  occurs  because  it 
is  sometimes  more  convenient  or  cost  effective  to  obtain  the  data  from  a  subpopulation 
even  though  it  is  not  representative  of  the  entire  population.  We  now  consider  several 
examples  of  such  departures,  beginning  with  a  case  in  which  there  is  no  departure  from 
randomness. 


Exogenous  Sampling 

Exogenous  sampling  from  survey  data  occurs  if  the  analyst  segments  the  available 
sample  into  subsamples  based  only  on  a  set  of  exogenous  variables  x,  but  not  on  the 
response  variable.  For  example,  in  a  study  of  hospitalizations  in  Germany,  Geil  et  al. 
(1997)  segmented  the  data  into  two  categories,  those  with  and  without  chronic  condi¬ 
tions.  Classification  by  income  categories  is  also  common.  Perhaps  it  is  more  accurate 
to  depict  this  type  of  sampling  as  exogenous  subsampling  because  it  is  done  by  ref¬ 
erence  to  an  existing  sample  that  has  already  been  collected.  Segmenting  an  existing 
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sample  by  gender,  health,  or  socioeconomic  status  is  very  common.  Under  the  assump¬ 
tions  of  exogenous  sampling  the  probability  distribution  of  the  exogenous  variables 
is  independent  of  y  and  contains  no  information  about  the  population  parameters  of 
interest,  9.  Therefore,  one  may  ignore  the  marginal  distribution  of  the  exogenous  vari¬ 
ables  and  simply  base  estimation  on  the  conditional  distribution  f(y  |x,  9).  Of  course, 
the  assumption  may  be  wrong  and  the  observed  distribution  of  the  outcome  variable 
may  depend  on  the  selected  segmenting  variable,  which  may  be  correlated  with  the 
outcome,  thus  causing  departure  from  exogenous  sampling. 


Response-Based  Sampling 

Response-based  sampling  occurs  if  the  probability  of  an  individual  being  included 
in  the  sample  depends  on  the  responses  or  choices  made  by  that  individual.  In  this 
case  sample  selection  proceeds  in  terms  of  rules  defined  in  terms  of  the  endogenous 
variable  under  study. 

Three  examples  are  as  follows:  (1)  In  a  study  of  the  effect  of  negative  income  tax  or 
Aid  to  Families  with  Dependent  Children  (AFDC)  on  labor  supply  only  those  below 
the  poverty  line  are  surveyed.  (2)  In  a  study  of  determinants  of  public  transport  modal 
choice,  only  users  of  public  transport  (a  subpopulation)  are  surveyed.  (3)  In  a  study  of 
the  determinants  of  number  of  visits  to  a  recreational  site,  only  those  with  at  least  one 
visit  are  included. 

Lower  survey  costs  provide  an  important  motivation  for  using  choice-based  samples 
in  preference  to  simple  random  samples.  It  would  require  a  very  large  random  sample 
to  generate  enough  observations  (information)  about  a  relatively  infrequent  outcome 
or  choice,  and  hence  it  is  cheaper  to  collect  a  sample  from  those  who  have  actually 
made  the  choice. 

The  practical  significance  of  this  is  that  consistent  estimation  of  population  param¬ 
eters  9  can  no  longer  be  earned  out  using  the  conditional  population  density  /(y|x) 
alone.  The  effect  of  the  sampling  scheme  must  also  be  taken  into  account.  This  topic 
is  discussed  further  in  Section  24.4. 


Length-Biased  Sampling 

Length-biased  sampling  illustrates  how  biases  may  result  from  sampling  one  popu¬ 
lation  to  make  inferences  about  a  different  population.  Strictly  speaking,  it  is  not  so 
much  an  example  of  departure  from  randomness  in  sampling  as  one  of  sampling  the 
“wrong”  population. 

Econometric  studies  of  transitions  model  the  time  spent  in  origin  state  j  by  indi¬ 
vidual  i  before  transiting  to  another  destination  state  s.  An  example  is  when  j  cor¬ 
responds  to  unemployment  and  s  to  employment.  The  data  used  in  such  studies  can 
come  from  one  of  several  possible  sources.  One  source  is  sampling  individuals  who 
are  unemployed  on  a  particular  date,  another  is  to  sample  those  who  are  in  the  labor 
force  regardless  of  their  current  state,  and  a  third  is  to  sample  individuals  who  are  ei¬ 
ther  entering  or  leaving  unemployment  during  a  specified  period  of  time.  Each  type 
of  sampling  scheme  is  based  on  a  different  concept  of  the  relevant  population.  In  the 
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first  case  the  relevant  population  is  the  stock  of  unemployed  individuals,  in  the  second 
the  labor  force,  and  in  the  third  individuals  with  transitioning  employment  status.  This 
topic  is  discussed  further  in  Section  18.6. 

Suppose  that  the  purpose  of  the  survey  is  to  calculate  a  measure  of  the  average 
duration  of  unemployment.  This  is  the  average  length  of  time  a  randomly  chosen  indi¬ 
vidual  will  spend  in  unemployment  if  he  or  she  becomes  unemployed.  The  answer  to 
this  apparently  straightforward  question  may  vary  depending  on  how  the  sample  data 
are  obtained.  The  flow  distribution  of  completed  durations  is  in  general  quite  differ¬ 
ent  from  the  stock  distribution.  When  we  sample  the  stock,  the  probability  of  being  in 
the  sample  is  higher  for  individuals  with  longer  durations.  When  we  sample  the  flow 
out  of  the  state,  the  probability  does  not  depend  on  the  time  spent  in  the  state.  This 
is  the  well-known  example  of  length-biased  sampling  in  which  the  estimate  obtained 
by  sampling  the  stock  is  a  biased  estimate  of  the  average  length  of  an  unemployment 
spell  of  a  random  entrant  to  unemployment. 

The  following  simple  schematic  diagram  may  clarify  the  point: 

•  • 

o  •  — >■  — >  o  o  • 

Entry  flow  *  °  Exit  flow 

Stock 

Here  we  use  the  symbol  •  to  denote  slow  movers  and  the  symbol  o  to  denote  fast 
movers.  Suppose  the  two  types  are  equally  represented  in  the  flow,  but  the  slow  movers 
stay  in  the  stock  longer  than  the  fast  movers.  Then  the  stock  population  has  a  higher 
proportion  of  slow  movers.  Finally,  the  exit  population  has  a  higher  proportion  of  fast 
movers.  The  argument  will  generalize  to  other  types  of  heterogeneity. 

The  point  of  this  example  is  not  that  flow  sampling  is  a  better  thing  to  do  than  stock 
sampling.  Rather,  it  is  that,  depending  on  what  the  question  is,  stock  sampling  may  not 
yield  a  random  sample  of  the  relevant  population. 


3.2.5.  Bias  due  to  Sample  Selection 

Consider  the  following  problem.  A  researcher  is  interested  in  measuring  the  effect  of 
training,  denoted  z  (treatment),  on  posttraining  wages,  denoted  y  (outcome),  given  the 
worker’s  characteristics,  denoted  x.  The  variable  z  takes  the  value  1  if  the  worker  has 
received  training  and  is  0  otherwise.  Observations  are  available  on  (x,  D)  for  all  work¬ 
ers  but  on  y  only  for  those  who  received  training  (I)  =  1).  One  would  like  to  make 
inferences  about  the  average  impact  of  training  on  the  posttraining  wage  of  a  ran¬ 
domly  chosen  worker  with  known  characteristics  who  is  currently  untrained  (0  =  0). 
The  problem  of  sample  selection  concerns  the  difficulty  of  making  such  an  inference. 

Manski  (1995),  who  views  this  as  a  problem  of  identification,  defines  the  selection 
problem  formally  as  follows: 

This  is  the  problem  of  identifying  conditional  probability  distributions  from  random 
sample  data  in  which  the  realizations  of  the  conditioning  variables  are  always  ob¬ 
served  but  realizations  of  the  outcomes  are  censored. 
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Suppose  y  is  the  outcome  to  be  predicted,  and  the  conditioning  variables  are  denoted 
by  x.  The  variable  z  is  a  censoring  indicator  that  takes  the  value  1  if  the  outcome  y  is 
observed  and  0  otherwise.  Because  the  variables  ( D ,  x)  are  always  observed,  but  y  is 
observed  only  when  D  =  1 ,  Manski  views  this  as  a  censored  sampling  process.  The 
censored  sampling  process  does  not  identify  Pr[y  |x],  as  can  be  seen  from 

Pr[y|x]  =  Pr[y|x,  D  =  l]Pr[Z)  =  l|x]  +Pr[y|x,  D  =  0]Pr[D  =  0|x],  (3.2) 

The  sampling  process  can  identify  three  of  the  four  terms  on  the  right-hand  side, 
but  provides  no  information  about  the  term  Pr[v|x,  D  =  0].  Because 

E[y|x]  =  E[y|x,  D  =  1]  •  Pr[£>  =  l|x]  +  E[y|x,  D  =  0|  ■  Pr [D  =  0|x], 

whenever  the  censoring  probability  Pr[£)  =  0|x]  is  positive,  the  available  empirical 
evidence  places  no  restrictions  on  E[y|x].  Consequently,  the  censored-sampling  pro¬ 
cess  can  identify  Pr[v|x]  only  for  some  unknown  value  of  Pr[y|x,  D  =  0|.  To  learn 
anything  about  the  E[y|x],  restrictions  will  need  to  be  placed  on  Pr[y|x]. 

The  alternative  approaches  for  solving  this  problem  are  discussed  in  Section  16.5. 


3.2.6.  Quality  of  Survey  Data 

The  quality  of  sample  data  depends  not  only  on  the  sample  design  and  the  survey 
instrument  but  also  on  the  survey  responses.  This  observation  applies  especially  to 
observational  data.  We  consider  several  ways  in  which  the  quality  of  the  sample  data 
may  be  compromised.  Some  of  the  problems  (e.g.,  attrition)  can  also  occur  with  other 
types  of  data.  This  topic  overlaps  with  that  of  biased  sampling. 


Problem  of  Survey  Nonresponse 

Surveys  are  normally  voluntary,  and  incentive  to  participate  may  vary  systematically 
according  to  household  characteristics  and  type  of  question  asked.  Individuals  may 
refuse  to  answer  some  questions.  If  there  is  a  systematic  relationship  between  refusal 
to  answer  a  question  and  the  characteristics  of  the  individual,  then  the  issue  of  the 
representativeness  of  a  survey  after  allowing  for  nonresponse  arises.  If  nonresponse 
is  ignored,  and  if  the  analysis  is  carried  out  using  the  data  from  respondents  only,  how 
will  the  estimation  of  parameters  of  interest  be  affected? 

Survey  nonresponse  is  a  special  case  of  the  selection  problem  mentioned  in  the 
preceding  section.  Both  involve  biased  samples.  To  illustrate  how  it  leads  to  distorted 
inference  consider  the  following  model: 
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where  yi  is  a  continuous  random  variable  of  interest  (e.g.,  expenditure)  that  depends 
on  x,  and  yi  is  a  latent  variable  that  measures  the  “propensity  to  participate”  in  a  survey 
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and  depends  on  z.  The  individual  participates  if  V2  >  0;  otherwise  the  individual  does 
not.  The  variables  x  and  z  are  assumed  to  be  exogenous.  The  formulation  allows  y\ 
and  y2  to  be  correlated. 

Suppose  we  estimate  / 3  from  the  data  supplied  by  participants  by  least  squares. 
Is  this  estimator  unbiased  in  the  presence  of  nonparticipation?  The  answer  is  that  if 
nonparticipation  is  random  and  independent  of  yi ,  the  variable  of  interest,  then  there 
is  no  bias,  but  otherwise  there  will  be. 

The  argument  is  as  follows: 

3  =  [X'X]_1X'yi, 

E[3  -  /3]  =  E  [[X'X]"1  X'Efyi  -  X/3|X,  Z,  y2  >  o]  , 

where  the  first  line  gives  the  least-squares  formula  for  the  estimates  of  (3  and  the  second 
line  gives  its  bias.  If  yl  and  y2  are  independent,  conditional  on  X  and  Z,  cri2  =  0, 
then 


E[yi  -  X/3|X,  Z,  y2  >  0]  =  E[yi  -  X/3 |X,  Z]  =  0, 


and  there  is  no  bias. 


Missing  and  Mismeasured  Data 

Survey  respondents  dealing  with  an  extensive  questionnaire  will  not  necessarily  an¬ 
swer  every  question  and  even  if  they  do,  the  answers  may  be  deliberately  or  fortu¬ 
itously  false.  Suppose  that  the  sample  survey  attempts  to  obtain  a  vector  of  responses 

denoted  as  x,  =(xj  \ _ ,  x^k)  from  N  individuals,  /  =  1, . . . ,  N.  Suppose  now  that 

if  an  individual  fails  to  provide  information  on  any  one  or  more  elements  of  x, ,  then 
the  entire  vector  is  discarded.  The  first  problem  resulting  from  missing  data  is  that  the 
sample  size  is  reduced.  The  second  potentially  more  serious  problem  is  that  missing 
data  can  potentially  lead  to  biases  similar  to  the  selection  bias.  If  the  data  are  missing 
in  a  systematic  manner,  then  the  sample  that  is  left  to  analyze  may  not  be  represen¬ 
tative  of  the  population.  A  form  of  selection  bias  may  be  induced  by  any  systematic 
pattern  of  nonresponse.  For  example,  high-income  respondents  may  systematically  not 
respond  to  questions  about  income.  Conversely,  if  the  data  are  missing  completely  at 
random  then  discarding  incomplete  observations  will  reduce  precision  but  not  gen¬ 
erate  biases.  Chapter  27  discusses  the  missing-data  problem  and  solutions  in  greater 
depth. 

Measurement  errors  in  survey  responses  are  a  pervasive  problem.  They  can  arise 
from  a  variety  of  causes,  including  incorrect  responses  arising  from  carelessness,  de¬ 
liberate  misreporting,  faulty  recall  of  past  events,  incorrect  interpretation  of  questions, 
and  data-processing  errors.  A  deeper  source  of  measurement  error  is  due  to  the  mea¬ 
sured  variable  being  at  best  an  imperfect  proxy  for  the  relevant  theoretical  concept. 
The  consequences  of  such  measurement  errors  is  a  major  topic  and  is  discussed  in 
Chapter  26. 
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Sample  Attrition 

In  panel  data  situations  the  survey  involves  repeated  observations  on  a  set  of  individu¬ 
als.  In  this  case  we  can  have 

•  full  response  in  all  periods  (full  participation), 

•  nonresponse  in  the  first  period  and  in  all  subsequent  periods  (nonparticipation),  or 

•  partial  response  in  the  sense  of  response  in  the  initial  periods  but  nonresponse  in  later 
periods  (incomplete  participation)  -  a  situation  referred  to  as  sample  attrition. 

Sample  attrition  leads  to  missing  data,  and  the  presence  of  any  nonrandom  pattern 
of  “missingness”  will  lead  to  the  sample  selection  type  problems  already  mentioned. 
This  can  be  interpreted  as  a  special  case  of  the  sample  selection  problem.  Sample 
attrition  is  discussed  briefly  in  Sections  21.8.5  and  23.5.2. 


3.2.7.  Types  of  Observational  Data 

Cross-section  data  are  obtained  by  observing  w,  for  the  sample  St  for  some  t.  Al¬ 
though  it  is  usually  impractical  to  sample  all  households  at  the  same  point  of  time, 
cross-section  data  are  still  a  snapshot  of  characteristics  of  each  element  of  a  subset  of 
the  population  that  will  be  used  to  make  inferences  about  the  population.  If  the  pop¬ 
ulation  is  stationary,  then  inferences  made  about  9,  using  St  may  be  valid  also  for 
t'  /  t.  If  there  is  significant  dependence  between  past  and  current  behavior,  then  lon¬ 
gitudinal  data  are  required  to  identify  the  relationship  of  interest.  For  example,  past 
decisions  may  affect  current  outcomes;  inertia  or  habit  persistence  may  account  for 
current  purchases,  but  such  dependence  cannot  be  modeled  if  the  history  of  purchases 
is  not  available.  This  is  one  of  the  limitations  imposed  by  cross-section  data. 

Repeated  cross-section  data  are  obtained  by  a  sequence  of  independent  samples 
S,  taken  from  F(wt\9t).  t  =  1, . . . ,  T.  Because  the  sample  design  does  not  attempt  to 
retain  the  same  units  in  the  sample,  information  about  dynamic  dependence  in  behavior 
is  lost.  If  the  population  is  stationary  then  repeated  cross-section  data  are  obtained  by 
a  sampling  process  somewhat  akin  to  sampling  with  replacement  from  the  constant 
population.  If  the  population  is  nonstationary,  repeated  cross  sections  are  related  in  a 
manner  that  depends  on  how  the  population  is  changing  over  time.  In  such  a  case  the 
objective  is  to  make  inferences  about  the  underlying  constant  (hyper)parameters.  The 
analysis  of  repeated  cross  sections  is  discussed  in  Section  22.7. 

Panel  or  longitudinal  data  are  obtained  by  initially  selecting  a  sample  S  and 
then  collecting  observations  for  a  sequence  of  time  periods,  t  =  \ .....  T.  This  can 
be  achieved  by  interviewing  subjects  and  collecting  both  present  and  past  data  at  the 
same  time,  or  by  tracking  the  subjects  once  they  have  been  inducted  into  the  survey. 
This  produces  a  sequence  of  data  vectors  {wi, . . . ,  wT }  that  are  used  to  make  infer¬ 
ences  about  either  the  behavior  of  the  population  or  that  of  the  particular  sample  of 
individuals.  The  appropriate  methodology  in  each  case  may  not  be  the  same.  If  the 
data  are  drawn  from  a  nonstationary  population,  the  appropriate  objective  should  be 
inference  on  (hyper)parameters  of  the  superpopulation. 
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Some  limitations  of  these  types  of  data  are  immediately  obvious.  Cross-section 
samples  and  repeated  cross-sections  do  not  in  general  provide  suitable  data  for  mod¬ 
eling  intertemporal  dependence  in  outcomes.  Such  data  are  only  suitable  for  modeling 
static  relationships.  In  contrast,  longitudinal  data,  especially  if  they  span  a  sufficiently 
long  time  period,  are  suitable  for  modeling  both  static  and  dynamic  relationships. 

Longitudinal  data  are  not  free  from  problems.  The  first  issue  is  representativeness  of 
the  panel.  Problems  of  inference  regarding  population  behavior  using  longitudinal  data 
become  more  difficult  if  the  population  is  not  stationary.  For  analyzing  dynamics  of  be¬ 
havior,  retaining  original  households  in  the  panel  for  as  long  as  possible  is  an  attractive 
option.  In  practice,  longitudinal  data  sets  suffer  from  the  problem  of  “sample  attrition,” 
perhaps  due  to  “sample  fatigue.”  This  simply  means  that  survey  respondents  do  not 
continue  to  provide  responses  to  questionnaires.  This  creates  two  problems:  (1)  The 
panel  becomes  unbalanced  and  (2)  there  is  the  danger  that  the  retained  household  may 
not  be  “typical”  and  that  the  sample  becomes  unrepresentative  of  the  population.  When 
the  available  sample  data  are  not  a  random  draw  from  the  population,  results  based  on 
different  types  of  data  will  be  susceptible  to  biases  to  different  degrees.  The  problem 
of  “sample  fatigue”  arises  because  over  time  it  becomes  more  difficult  to  retain  in¬ 
dividuals  within  the  panel  or  they  may  be  “lost”  (censored)  for  some  other  reason, 
such  as  a  change  of  location.  These  issues  are  dealt  with  later  in  the  book.  Analysis 
of  longitudinal  data  may  nevertheless  provide  information  about  some  aspects  of  the 
behavior  of  the  sampled  units,  although  extrapolation  to  population  behavior  may  not 
be  straightforward. 


3.3.  Data  from  Social  Experiments 

Observational  and  experimental  data  are  distinct  because  an  experimental  environment 
can  in  principle  be  closely  monitored  and  controlled.  This  makes  it  possible  to  vary 
a  causal  variable  of  interest,  holding  other  covariates  at  controlled  settings.  In  con¬ 
trast,  observational  data  are  generated  in  an  uncontrolled  environment,  leaving  open 
the  possibility  that  the  presence  of  confounding  factors  will  make  it  more  difficult  to 
identify  the  causal  relationship  of  interest.  For  example,  when  one  attempts  to  study 
the  earnings-schooling  relationship  using  observational  data,  one  must  accept  that  the 
years  of  schooling  of  an  individual  is  itself  an  outcome  of  an  individual’s  decision¬ 
making  process,  and  hence  one  cannot  regard  the  level  of  schooling  as  if  it  had  been 
set  by  a  hypothetical  experimenter. 

In  social  sciences,  data  analogous  to  experimental  data  come  from  either  social 
experiments,  defined  and  described  in  greater  detail  in  the  following,  or  from  “labo¬ 
ratory”  experiments  on  small  groups  of  voluntary  participants  that  mimic  the  behavior 
of  economic  agents  in  the  real-life  counterpart  of  the  experiment.  Social  experiments 
are  relatively  uncommon,  and  yet  experimental  concepts,  methods,  and  data  serve  as  a 
benchmark  for  evaluating  econometric  studies  based  on  observational  data. 

This  section  provides  a  brief  account  of  the  methodology  of  social  experiments,  the 
nature  of  the  data  emanating  from  them,  and  some  problems  and  issues  of  econometric 
methodology  that  they  generate. 
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The  central  feature  of  the  experimental  methodology  involves  a  comparison  be¬ 
tween  the  outcomes  of  the  randomly  selected  experimental  group  that  is  subjected  to  a 
“treatment”with  those  of  a  control  (comparison)  group.  In  a  good  experiment  consid¬ 
erable  care  is  exercised  in  matching  the  control  and  experimental  (“treated”)  groups, 
and  in  avoiding  potential  biases  in  outcomes.  Such  conditions  may  not  be  realized 
in  observational  environments,  thereby  leading  to  a  possible  lack  of  identification  of 
causal  parameters  of  interest.  Sometimes,  however,  experimental  conditions  may  be 
approximately  replicated  in  observational  data.  Consider,  for  example,  two  contigu¬ 
ous  regions  or  states,  one  of  which  pursues  a  different  minimum- wage  policy  from  the 
other,  creating  the  conditions  of  a  natural  experiment  in  which  observations  from  the 
“treated”  state  can  be  compared  with  those  from  the  “control”  state.  The  data  structure 
of  a  natural  experiment  has  also  attracted  attention  in  econometrics. 

A  social  experiment  involves  exogenous  variations  in  the  economic  environment 
facing  the  set  of  experimental  subjects,  which  is  partitioned  into  one  subset  that  re¬ 
ceives  the  experimental  treatment  and  another  that  serves  as  a  control  group.  In  con¬ 
trast  to  observational  studies  in  which  changes  in  exogenous  and  endogenous  factors 
are  often  confounded,  a  well-designed  social  experiment  aims  to  isolate  the  role  of 
treatment  variables.  In  some  experimental  designs  there  may  be  no  explicit  control 
group,  but  varying  levels  of  the  treatment  are  applied,  in  which  case  it  becomes  pos¬ 
sible  in  principle  to  estimate  the  entire  response  surface  of  experimental  outcomes. 

The  primary  object  of  a  social  experiment  is  to  estimate  the  impact  of  an  actual 
or  potential  social  program.  The  potential  outcome  model  of  Section  2.7  provides  a 
relevant  background  for  modeling  the  impact  of  social  experiments.  Several  alternative 
measures  of  impact  have  been  proposed  and  these  will  be  discussed  in  the  chapter  on 
program  evaluation  (Chapter  25). 

Burtless  (1995)  summarizes  the  case  for  social  experiments,  while  noting  some 
potential  limitations.  In  a  companion  article  Heckman  and  Smith  (1995)  focus  on 
limitations  of  actual  social  experiments  that  have  been  implemented.  The  remaining 
discussion  in  this  section  borrows  significantly  from  these  papers. 


3.3.1.  Leading  Features  of  Social  Experiments 

Social  experiments  are  motivated  by  policy  issues  about  how  subjects  would  react  to  a 
type  of  policy  that  has  never  been  tried  and  hence  one  for  which  no  observed  response 
data  exist.  The  idea  of  a  social  experiment  is  to  enlist  a  group  of  willing  participants, 
some  of  whom  are  randomly  assigned  to  a  treatment  group  and  the  rest  to  a  control 
group.  The  difference  between  the  responses  of  those  in  the  treatment  group,  subjected 
to  the  policy  change,  and  those  in  the  control  group,  who  are  not,  is  the  estimated 
effect  of  the  policy.  Schematically  the  standard  experimental  design  is  as  depicted  in 
Figure  3.1. 

The  term  “experimentals”  refers  to  the  group  receiving  treatments,  “controls”  to  the 
group  not  receiving  treatment,  and  “random  assignment”  to  the  process  of  assigning 
individuals  to  the  two  groups. 

Randomized  trials  were  introduced  in  statistics  by  R.  A.  Fisher  (1928)  and  his 
co-workers.  A  typical  agricultural  experiment  would  consist  of  a  trial  in  which  a  new 
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Figure  3.1:  Social  experiment  with  random  assignment. 


treatment  such  as  fertilizer  application  would  be  applied  to  plants  growing  on  ran¬ 
domly  chosen  blocks  of  land  and  then  the  responses  would  be  compared  with  those 
of  a  control  group  of  plants,  similar  to  the  experimentals  in  all  relevant  respects  but 
not  given  experimental  treatment.  If  the  effect  of  all  other  differences  between  the  ex¬ 
perimental  and  control  groups  can  be  eliminated,  the  estimated  difference  between  the 
two  sets  of  responses  can  be  attributed  to  the  treatment.  In  the  simplest  situation  one 
can  concentrate  on  a  comparison  of  the  mean  outcome  of  the  treated  group  and  of  the 
untreated  group. 

Although  in  agricultural  and  biomedical  sciences,  the  randomized  experiments 
methodology  has  been  long  established,  in  economics  and  social  sciences  it  is  new. 
It  is  attractive  for  studying  responses  to  policy  changes  for  which  no  observational 
data  exist,  perhaps  because  the  policy  changes  of  interest  have  never  occurred.  Ran¬ 
domized  experiments  also  permit  a  greater  variation  in  policy  variables  and  parameters 
than  are  present  in  observational  data,  thereby  making  it  easier  to  identify  and  study 
responses  to  policy  changes.  In  many  cases  the  social  experiment  may  try  out  a  pol¬ 
icy  that  has  never  been  tried,  so  the  observational  data  remain  completely  silent  on  its 
potential  impact. 

Social  experiments  are  still  rather  rare  outside  the  United  States,  partly  because 
they  are  expensive  to  run.  In  the  United  States  a  number  of  such  experiments  have 
taken  place  since  the  early  1970s.  Table  3.1  summarizes  features  of  some  relatively 
well-known  examples;  for  a  more  extensive  coverage  see  Burtless  (1995). 

An  experiment  may  produce  either  cross-section  or  longitudinal  data,  although  cost 
considerations  will  usually  limit  the  time  dimension  well  below  what  is  typical  in  ob¬ 
servational  data.  When  an  experiment  lasts  several  years  and  has  multiple  stages  and/or 
geographical  locations,  as  in  the  case  of  RHIE,  interim  analyses  based  on  “incomplete” 
data  are  not  uncommon  (Newhouse  et  al.,  1993). 

3.3.2.  Advantages  of  Social  Experiments 

Burtless  (1995)  surveys  the  advantages  of  social  experiments  with  great  clarity. 
The  key  advantage  stems  from  randomized  trials  that  remove  any  correlation  be¬ 
tween  the  observed  and  unobserved  characteristics  of  program  participants.  Hence  the 
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Table  3.1.  Features  of  Some  Selected  Social  Experiments 


Experiment 


Tested  Treatments  Target  Population 


Rand  Health 
Insurance  Experiment 
(RHIE),  1974-1982 

Negative  Income  Tax 
(NIT),  1968-1978 


Health  insurance  plans  with 
varying  copayment  rate  and 
differing  levels  of  maximum 
out-of-pocket  expenses 
NIT  plans  with  alternative 
income  guarantees  and 
tax  rates 


Low-  and  moderate-level 
income  persons  and  families 


Low-  and  moderate-level 
income  persons  and  families 
with  nonaged  head  of  household 


Job  Training  Job  search  assistance, 

Partnership  Act  (JTPA),  on-the-job  training,  classroom 
(1986-1994)  training  financed  under  JTPA 


Out-of-school  youths  and 
disadvantaged  adults 


contribution  of  the  treatment  to  the  outcome  difference  between  the  treated  and  control 
groups  can  be  estimated  without  confounding  bias  even  if  one  cannot  control  for  the 
confounding  variables.  The  presence  of  correlation  between  treatment  and  confound¬ 
ing  variables  often  plagues  observational  studies  and  complicates  causal  inference.  By 
contrast,  an  experimental  study  conducted  under  ideal  circumstances  can  produce  a 
consistent  estimate  of  the  average  difference  in  outcomes  of  the  treated  and  nontreated 
groups  without  much  computational  complexity. 

If,  however,  an  outcome  depends  on  treatment  as  well  as  other  observable  fac¬ 
tors,  then  controlling  for  the  latter  will  in  general  improve  the  precision  of  the  impact 
estimate. 

Even  if  observational  data  are  available,  the  generation  and  use  of  experimental  data 
has  great  appeal  because  it  offers  the  possibility  of  exogenizing  a  policy  variable,  and 
randomization  of  treatments  can  potentially  lead  to  great  simplification  of  statistical 
analysis.  Conclusions  based  on  observational  data  often  lack  generality  because  they 
are  based  on  a  nonrandom  sample  from  the  population  -  the  problem  of  selection  bias. 
An  example  is  the  aforementioned  RHIE  study  whose  major  focus  is  on  the  price  re¬ 
sponsiveness  of  the  demand  for  health  services.  Availability  of  health  insurance  affects 
the  user  price  of  health  services  and  thereby  its  use.  An  important  policy  issue  is  the  ex¬ 
tent  to  which  “overutilization”  of  health  services  would  result  from  subsidized  health 
insurance.  One  can,  of  course,  use  observational  data  to  model  the  relation  between 
the  demand  for  health  services  and  the  level  of  insurance.  However,  such  analyses  are 
subject  to  the  criticism  that  the  level  of  health  insurance  should  not  be  treated  as  ex¬ 
ogenous.  Theoretical  analyses  show  that  the  demand  for  health  insurance  and  health 
care  are  jointly  determined,  so  causation  is  not  unidirectional.  This  fact  can  potentially 
make  it  difficult  to  identify  the  role  of  health  insurance.  Treating  health  insurance  as 
exogenous  biases  the  estimate  of  price  responsiveness.  However,  in  an  experimental 
setup  the  participating  households  could  be  assigned  an  insurance  policy,  making  it  an 
exogenous  variable.  The  role  of  insurance  is  then  identifiable.  Once  the  key  variable 
of  interest  is  exogenized,  the  direction  of  causation  becomes  clear  and  the  impact  of 
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the  treatment  can  be  studied  unambiguously.  Furthermore,  if  the  experiment  is  free 
from  some  of  the  problems  that  we  mention  in  the  following,  this  greatly  simplifies 
statistical  analysis  relative  to  what  is  often  necessary  in  survey  data. 


3.3.3.  Limitations  of  Social  Experiments 

The  application  of  a  nonhuman  methodology,  initially  that  is,  one  developed  for  and 
applied  to  nonhuman  subjects,  to  human  subjects  has  generated  a  lively  debate  in  the 
literature.  See  especially  Heckman  and  Smith  (1995),  who  argue  that  many  social  ex¬ 
periments  may  suffer  from  limitations  that  apply  to  observational  studies.  These  is¬ 
sues  concern  general  points  such  as  the  merits  of  experimental  versus  observational 
methodology,  as  well  as  specific  issues  concerning  the  biases  and  problems  inherent 
in  the  use  of  human  subjects.  Several  of  the  issues  are  covered  in  more  detail  in  later 
chapters  but  a  brief  overview  follows. 

Social  experiments  are  very  costly  to  run.  Sometimes,  perhaps  often,  they  do  not 
correspond  to  “clean”  randomized  trials.  Hence  the  results  from  such  experiments  are 
not  always  unambiguous  and  easily  interpretable,  or  free  from  biases.  If  the  treatment 
variable  has  many  alternative  settings  of  interest,  or  if  extrapolation  is  an  important 
objective,  then  a  very  large  sample  must  be  collected  to  ensure  sufficient  data  variation 
and  to  precisely  gauge  the  effect  of  treatment  variation.  In  that  case  the  cost  of  the 
experiment  will  also  increase.  If  the  cost  factor  prevents  a  large  enough  experiment,  its 
utility  relative  to  observational  studies  may  be  questionable;  see  the  papers  by  Rosen 
and  Stafford  in  Hausman  and  Wise  (1985). 

Unfortunately  the  design  of  some  social  experiments  is  flawed.  Hausman  and  Wise 
(1985)  argue  that  the  data  from  the  New  Jersey  negative  income  tax  experiment  was 
subject  to  endogenous  stratification,  which  they  describe  as  follows: 

. . .  [T]he  reason  for  an  experiment  is,  by  randomization,  to  eliminate  correlation 
between  the  treatment  variable  and  other  determinants  of  the  response  variable  that 
is  under  study.  In  each  of  the  income-maintenance  experiments,  however,  the  exper¬ 
imental  sample  was  selected  in  part  on  the  basis  of  the  dependent  variable,  and  the 
assignment  to  treatment  versus  control  group  was  based  in  part  on  the  dependent 
variable  as  well.  In  general,  the  group  eligible  for  selection  -  based  on  family  status, 
race,  age  of  family  head,  etc.  -  was  stratified  on  the  basis  of  income  (and  other  vari¬ 
ables)  and  persons  were  selected  from  within  the  strata.  (Hausman  and  Wise,  1985, 
pp.  190-191) 

The  authors  conclude  that,  in  the  presence  of  endogenous  stratification,  unbiased  es¬ 
timation  of  treatment  effects  is  not  straightforward.  Unfortunately,  a  fully  randomized 
trial  in  which  treatment  assignment  within  a  randomly  selected  experimental  group 
from  the  population  is  independent  of  income  would  be  much  more  costly  and  may 
not  be  feasible. 

There  are  several  other  issues  that  detract  from  the  ideal  simplicity  of  a  random¬ 
ized  experiment.  First,  if  experimental  sites  are  selected  randomly,  cooperation  of 
administrators  and  potential  participants  at  that  site  would  be  required.  If  this  is  not 
forthcoming,  then  alternative  treatment  sites  where  such  cooperation  is  obtainable 
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will  be  substituted,  thereby  compromising  the  random  assignment  principle;  see  Hotz 
(1992). 

A  second  problem  is  that  of  sample  selection,  which  is  relevant  because  participa¬ 
tion  is  voluntary.  For  ethical  reasons  there  are  many  experiments  that  simply  cannot 
be  done  (e.g.,  random  assignment  of  students  to  years  of  education).  Unlike  medical 
experiments  that  can  achieve  the  gold  standard  of  a  double-blind  protocol,  in  social 
experiments  experimenters  and  subjects  know  whether  they  are  in  treatment  or  con¬ 
trol  groups.  Furthermore,  those  in  control  groups  may  obtain  treatment,  (e.g.,  training) 
from  alternative  sources.  If  the  decision  to  participate  is  uncorrelated  with  either  x  or 
e,  the  analysis  of  the  experimental  data  is  simplified. 

A  third  problem  is  sample  attrition  caused  by  subjects  dropping  out  of  the  experi¬ 
ment  after  it  has  started.  Even  if  the  initial  sample  was  random  the  effect  of  nonran¬ 
dom  attrition  may  well  lead  to  a  problem  similar  to  the  attrition  bias  in  panels.  Finally, 
there  is  the  problem  of  Hawthorne  effect.  The  term  originates  in  social  psychology 
research  conducted  jointly  by  the  Harvard  Graduate  School  of  Business  Administra¬ 
tion  and  the  management  of  the  Western  Electric  Company  at  the  latter’s  Hawthorne 
works  in  Chicago  from  1926  to  1932.  Human  subjects,  unlike  inanimate  objects,  may 
change  or  adapt  their  behavior  while  participating  in  the  experiment.  In  this  case  the 
variation  in  the  response  observed  under  experimental  conditions  cannot  be  attributed 
solely  to  treatment. 

Heckman  and  Smith  (1995)  mention  several  other  difficulties  in  implementing  a 
randomized  treatment.  Because  the  administration  of  a  social  experiment  involves  a 
bureaucracy,  there  is  a  potential  for  biases.  Randomization  bias  occurs  if  the  assign¬ 
ment  introduces  a  systematic  difference  between  the  experimental  participant  and  the 
participant  during  its  normal  operation.  Heckman  and  Smith  document  the  possibilities 
of  such  bias  in  actual  experiments.  Another  type  of  bias,  called  substitution  bias,  is 
introduced  when  the  controls  may  be  receiving  some  form  of  treatment  that  substitutes 
for  the  experimental  treatment.  Finally,  analysis  of  social  experiments  is  inevitably  of 
a  partial  equilibrium  nature.  One  cannot  reliably  extrapolate  the  treatment  effects  to 
the  entire  population  because  the  ceteris  paribus  assumption  will  not  hold  when  the 
entire  population  is  involved. 

Specifically,  the  key  issue  is  whether  one  can  extrapolate  the  results  from  the  exper¬ 
iment  to  the  population  at  large.  If  the  experiment  is  conducted  as  a  pilot  program  on  a 
small  scale,  but  the  intention  is  to  predict  the  impact  of  policies  that  are  more  broadly 
applied,  then  the  obvious  limitation  is  that  the  pilot  program  cannot  incorporate  the 
broader  impact  of  the  treatment.  A  broadly  applied  treatment  may  change  the  eco¬ 
nomic  environment  sufficiently  to  invalidate  the  predictions  from  a  partial  equilibrium 
setup.  So  the  treatment  will  not  be  like  the  actual  policy  that  it  mimics. 

In  summary,  social  experiments,  in  principle,  could  yield  data  that  are  easier  to  an¬ 
alyze  and  to  understand  in  terms  of  cause  and  effect  than  observational  data.  Whether 
this  promise  is  realized  depends  on  the  experimental  design.  A  poor  experimen¬ 
tal  design  generates  its  own  statistical  complications,  which  affect  the  precision  of 
the  conclusions.  Social  experiments  differ  fundamentally  from  those  in  biology  and 
agriculture  because  human  subjects  and  treatment  administrators  tend  to  be  both 
active  and  forward-looking  individuals  with  personal  preferences,  rather  than 
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Table  3.2.  Features  of  Some  Selected  Natural  Experiments 


Experiment 

Treatments  Studied 

Reference 

Outcomes  for  identical  twins 

Differences  in  returns  to 

Ashenfelter  and 

with  different  schooling  levels 

schooling  through  correlation 
between  schooling  and  wages 

Krueger (1994) 

Transition  to  National  Health 

Labor  market  effects  of  NHI 

Gruber  and 

Insurance  in  Canada  as  Sasketchwan 

moves  to  NHI  and  other  states 
follow  several  years  later 

based  on  comparison  of 
provinces  with  and  without  NHI 

Hanratty  (1995) 

New  Jersey  increases  minimum 

Minimum  wage  effects  on 

Card  and 

wage  while  neighboring 

Pennsylvania  does  not 

employment 

Krueger (1994) 

passive  administrators  of  a  standard  protocol  or  willing  recipients  of  randomly  as¬ 
signed  treatment. 


3.4.  Data  from  Natural  Experiments 


Sometimes,  however,  a  researcher  may  have  available  data  from  a  “natural  experi¬ 
ment.”  A  natural  experiment  occurs  when  a  subset  of  the  population  is  subjected  to 
an  exogenous  variation  in  a  variable,  perhaps  as  a  result  of  a  policy  shift,  that  would 
ordinarily  be  subject  to  endogenous  variation.  Ideally,  the  source  of  the  variation  is 
well  understood. 

In  microeconometrics  there  are  broadly  two  ways  in  which  the  idea  of  a  natural 
experiment  is  exploited.  For  concreteness  consider  the  simple  regression  model 

y  =  +  fax  +  u,  (3.4) 


where  x  is  an  endogenous  treatment  variable  correlated  with  u. 

Suppose  that  there  is  an  exogenous  intervention  that  changes  x.  Examples  of  such 
external  intervention  are  administrative  rules,  unanticipated  legislation,  natural  events 
such  as  twin  births,  weather-related  shocks,  and  geographical  variation;  see  Table  3.2 
for  examples.  Exogenous  intervention  creates  an  opportunity  for  evaluating  its  im¬ 
pact  by  comparing  the  behavior  of  the  impacted  group  both  pre-  and  postintervention, 
or  with  that  of  a  nonimpacted  group  postintervention.  That  is,  “natural”  comparison 
groups  are  generated  by  the  event  that  facilitates  estimation  of  the  fa.  Estimation  is 
simplified  because  x  can  be  treated  as  exogenous. 

The  second  way  in  which  a  natural  experiment  can  assist  inference  is  by  generating 
natural  instrumental  variables.  Suppose  z  is  a  variable  that  is  correlated  with  x,  or 
perhaps  causally  related  to  x,  and  uncorrelated  with  u.  Then  an  instrumental  variable 
estimator  of  fa,  expressed  in  terms  of  sample  covariances,  is 


%  = 


Cov[z,  y] 
Cov[z,  x] 


(3.5) 
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(see  Section  4.8.5).  In  an  observational  data  setup  an  instrumental  variable  with  the 
right  properties  may  be  difficult  to  find,  but  it  could  arise  naturally  in  a  favorable 
natural  experiment.  Then  estimation  would  be  simplified.  We  consider  the  first  case 
in  the  next  section;  the  topic  of  naturally  generated  instruments  will  be  covered  in 
Chapter  25. 


3.4.1.  Natural  Exogenous  Interventions 

Such  data  are  less  expensive  to  collect  and  they  also  allow  the  researcher  to  evaluate  the 
role  of  some  specific  factor  in  isolation,  as  in  a  controlled  experiment,  because  “nature” 
holds  constant  variations  attributed  to  other  factors  that  are  not  of  direct  interest.  Such 
natural  experiments  are  attractive  because  they  generate  treatment  and  control  groups 
inexpensively  and  in  a  real-world  setting.  Whether  a  natural  experiment  can  support 
convincing  inference  depends,  in  part,  on  whether  the  supposed  natural  intervention 
is  genuinely  exogenous,  whether  its  impact  is  sufficiently  large  to  be  measurable,  and 
whether  there  are  good  treatment  and  control  groups.  Just  because  a  change  is  legis¬ 
lated,  for  example,  does  not  mean  that  it  is  an  exogenous  intervention.  However,  in 
appropriate  cases,  opportunistic  exploitation  of  such  data  sets  can  yield  valuable  em¬ 
pirical  insights. 

Investigations  based  on  natural  experiments  have  several  potential  limitations 
whose  importance  in  any  given  study  can  only  be  assessed  through  a  careful  con¬ 
sideration  of  the  relevant  theory,  facts,  and  institutional  setting.  Following  Campbell 
(1969)  and  Meyer  (1995),  these  are  grouped  into  limitations  that  affect  a  study’s  inter¬ 
nal  validity  (i.e.,  the  inferences  about  policy  impact  drawn  from  the  study)  and  those 
that  affect  a  study’s  external  validity  (i.e.,  the  generalization  of  the  conclusions  to  other 
members  of  the  population). 

Consider  an  investigation  of  a  policy  change  in  which  conclusions  are  drawn  from 
a  comparison  of  pre-  and  postintervention  data,  using  the  regression  method  briefly 
described  in  the  following  and  in  greater  detail  in  Chapter  25.  In  any  study  there  will 
be  omitted  variables  that  may  have  also  changed  in  the  time  interval  between  policy 
change  and  its  impact.  The  characteristics  of  sampled  individuals  such  as  age,  health 
status,  and  their  actual  or  anticipated  economic  environment  may  also  change.  These 
omitted  factors  will  directly  affect  the  measured  impact  of  the  policy  change.  Whether 
the  results  can  be  generalized  to  other  members  of  the  population  will  depend  on  the 
absence  of  bias  due  to  nonrandom  sampling,  existence  of  significant  interaction  effects 
between  the  policy  change  and  its  setting,  and  an  absence  of  the  role  of  historical 
factors  that  would  cause  the  impact  to  vary  from  one  situation  to  another.  Of  course, 
these  considerations  are  not  unique  to  data  from  natural  experiments;  rather,  the  point 
is  that  the  latter  are  not  necessarily  free  from  these  problems. 

3.4.2.  Differences  in  Differences 

One  simple  regression  method  is  based  on  a  comparison  of  outcomes  in  one  group 
before  and  after  a  policy  intervention.  For  example,  consider 

yi,  =  a  +  /3  D,  +  £it,  i  N ,  t  —  0,  1, 
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where  D,  =  1  in  period  1  (postintervention),  D,  =  0  in  period  0  (preintervention),  and 
yit  measures  the  outcome.  The  regression  estimated  from  the  pooled  data  will  yield  an 
estimate  of  policy  impact  parameter  fi.  This  is  easily  shown  to  be  equal  to  the  average 
difference  in  the  pre-  and  postintervention  outcome, 

$  =  AT1  (yn  -  yi0) 

=  y i  -  Jo- 

The  one-group  before  and  after  design  makes  the  strong  assumption  that  the  group 
remains  comparable  over  time.  This  is  required  for  identifiability  of  fJ>.  If,  for  exam¬ 
ple,  we  allowed  a  to  vary  between  the  two  periods,  j3  would  no  longer  be  identified. 
Changes  in  a  are  confounded  with  the  policy  impact. 

One  way  to  improve  on  the  previous  design  is  to  include  an  additional  untreated 
comparison  group,  that  is,  one  not  impacted  by  policy,  and  for  which  the  data  are  avail¬ 
able  in  both  periods.  Using  Meyer’s  (1995)  notation,  the  relevant  regression  now  is 

yjt  —  a  +  a\D,  +  a1  ZU  +  f}D]t  +  sjt,  i  =  1, . . . ,  N,  t  =  0,1, 

where  j  is  the  group  superscript,  />'  =  1  if  j  equals  1  and  D'  =  0  otherwise,  D[  =  1 
if  both  j  and  t  equal  1  and  I),  =  0  otherwise,  and  s  is  a  zero-mean  constant- variance 
error  term.  The  equation  does  not  include  covariates  but  they  can  be  added,  and  those 
that  do  not  vary  are  already  subsumed  under  a.  This  relation  implies  that,  for  the 
treated  group,  we  have  preintervention 

yfo  =  a  +  a1£>1 +  4 


The  corresponding  equations  for  the  untreated  group  are 

y%  =  «  +  e,°0, 


Both  the  first-difference  equations  include  the  period- 1  specific  effect  oq,  which  can 
be  eliminated  by  taking  the  difference  between  Equations  (3.6)  and  (3.7): 


(yh  -  vk)  -  (yn  -  y?0)  =  V  +  (4  -  4)  -  (4  -  4)  • 


Assuming  that  E[(elj  —  sj0)  —  (e®  —  e;°0)]  equals  zero,  we  can  obtain  an  unbiased 
estimate  of  /;  by  the  sample  average  of  (y\  —  y]i(})  —  tv,0,  —  y(°0).  This  method  uses 
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differences  in  differences.  If  time-varying  covariates  are  present,  they  can  be 
included  in  the  relevant  equations  and  their  differences  will  appear  in  the  regression 
equation  (3.8). 

For  simplicity  our  analysis  ignored  the  possibility  that  there  remain  observable  dif¬ 
ferences  in  the  distribution  of  characteristics  between  the  treatment  and  control  groups. 
If  so,  then  such  differences  must  be  controlled  for.  The  standard  solution  is  to  include 
such  controlling  variables  in  the  regression. 

An  example  of  a  study  based  on  a  natural  experiment  is  that  of  Ashenfelter  and 
Krueger  (1994).  They  estimate  the  returns  to  schooling  by  contrasting  the  wage  rates 
of  identical  twins  with  different  schooling  levels.  In  this  case  running  a  regular  exper¬ 
iment  in  which  individuals  are  exogenously  assigned  different  levels  of  schooling  is 
simply  not  feasible.  Nonetheless,  some  experimental-type  controls  are  needed.  As  the 
authors  explain: 

Our  goal  is  to  ensure  that  the  correlation  we  observe  between  schooling  and  wage 
rates  is  not  due  to  a  correlation  between  schooling  and  a  worker’s  ability  or  other 
characteristics.  We  do  this  by  taking  advantage  of  the  fact  that  monozygotic  twins 
are  genetically  identical  and  have  similar  family  backgrounds. 

Data  on  twins  have  served  as  a  basis  for  a  number  of  other  econometric  studies 
(Rosenzweig  and  Wolpin,  1980;  Bronars  and  Grogger,  1994).  Since  the  twinning  prob¬ 
ability  in  the  population  is  not  high,  an  important  issue  is  generating  a  sufficiently 
large  representative  sample,  allowing  for  some  nonresponse.  One  source  of  such  data 
is  the  census.  Another  source  is  the  “twins  festivals”  that  are  held  in  the  United  States. 
Ashenfelter  and  Krueger  (1994,  p.  1158)  report  that  their  data  were  obtained  from  in¬ 
terviews  conducted  at  the  16th  Annual  Twins  Day  Festival,  Twinsburg,  Ohio,  August 
1991,  which  is  the  largest  gathering  of  twins,  triplets,  and  quadruplets  in  the  world. 

The  attraction  of  using  the  twins  data  is  that  the  presence  of  common  effects  from 
both  observable  and  unobservable  factors  can  be  eliminated  by  modeling  the  differ¬ 
ences  between  the  outcomes  of  the  twins.  For  example,  Ashenfelter  and  Krueger  esti¬ 
mate  a  regression  model  of  the  difference  in  the  log  of  wage  rates  between  the  first  and 
the  second  twin.  The  first  differencing  operation  eliminates  the  effects  of  age,  gender, 
ethnicity,  and  so  forth.  The  remaining  explanatory  variables  are  differences  between 
schooling  levels,  which  is  the  variable  of  main  interest,  and  variables  such  as  differ¬ 
ences  in  years  of  tenure  and  marital  status. 


3.4.3.  Identification  through  Natural  Experiments 

The  natural  experiments  school  has  had  a  useful  impact  on  econometric  practice.  By 
encouraging  the  opportunistic  exploitation  of  quasi-experimental  data,  and  by  using 
modeling  frameworks  such  as  the  POM  of  Chapter  2,  econometric  practice  bridges  the 
gap  between  observational  and  experimental  data.  The  notions  of  parameter  identifica¬ 
tion  rooted  in  the  SEM  framework  are  broadened  to  include  identification  of  measures 
that  are  interesting  from  a  policy  viewpoint.  The  main  advantage  of  using  data  from  a 
natural  experiment  is  that  a  policy  variable  of  interest  might  be  validly  treated  as  ex¬ 
ogenous.  However,  in  using  data  from  natural  experiments,  as  in  the  case  of  social 
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experiments,  the  choice  of  control  groups  plays  a  critical  role  in  determining  the 
reliability  of  the  conclusions.  Several  potential  problems  that  affect  a  social  experi¬ 
ment,  such  as  selectivity  and  attrition  bias,  will  also  remain  potential  problems  in  the 
case  of  natural  experiments.  Only  a  subset  of  interesting  policy  problems  may  lend 
themselves  to  analysis  within  the  natural  experiment  framework.  The  experiment  may 
apply  only  to  a  small  part  of  the  population,  and  the  conditions  under  which  it  occurs 
may  not  replicate  themselves  easily.  An  example  given  in  Section  22.6  illustrates  this 
point  in  the  context  of  difference  in  differences. 


3.5.  Practical  Considerations 

Although  there  has  been  an  explosion  in  the  number  and  type  of  microdata  sets  that 
are  available,  certain  well-established  databases  have  supported  numerous  studies.  We 
provide  a  very  partial  list  of  some  of  very  well  known  U.S.  micro  databases.  For  fur¬ 
ther  details,  see  the  respective  Web  sites  for  these  data  sets  or  the  data  clearinghouses 
mentioned  in  the  following.  Many  of  these  allow  you  to  download  the  data  directly. 


3.5.1.  Some  Sources  of  Microdata 

Panel  Study  in  Income  Dynamics  (PSID):  Based  at  the  Survey  Research  Center  at 
the  University  of  Michigan,  PSID  is  a  national  survey  that  has  been  running  since 
1968.  Today  it  covers  over  40,000  individuals  and  collects  economic  and  demo¬ 
graphic  data.  These  data  have  been  used  to  support  a  wide  variety  of  microecono- 
metric  analyses.  Brown,  Duncan  and  Stafford  (1996)  summarize  recent  develop¬ 
ments  in  PSID  data. 

Current  Population  Survey  (CPS):  This  is  a  monthly  national  survey  of  about  50,000 
households  that  provides  information  on  labor  force  characteristics.  The  survey  has 
been  conducted  for  more  than  50  years.  Major  revisions  in  the  sample  have  fol¬ 
lowed  each  of  the  decennial  censuses.  For  additional  details  about  this  survey  see 
Section  24.2.  It  is  the  basis  of  many  federal  government  statistics  on  earnings  and 
unemployment.  It  is  also  an  important  source  of  microdata  that  have  supported  nu¬ 
merous  studies  especially  of  labor  markets.  The  survey  was  redesigned  in  1994 
(Polivka,  1996). 

National  Longitudinal  Survey  (NLS):  The  NLS  has  four  original  cohorts:  NLS  Older 
Men,  NLS  Young  Men,  NLS  Mature  Women,  and  NLS  Young  Women.  Each  of 
the  original  cohorts  is  a  national  yearly  survey  of  over  5,000  individuals  who  have 
been  repeatedly  interviewed  since  the  mid-1960s.  Surveys  collect  information  on 
each  respondent’s  work  experiences,  education,  training,  family  income,  household 
composition,  marital  status,  and  health.  Supplementary  data  on  age,  sex,  etc.  are 
available. 

National  Longitudinal  Surveys  of  Youth  (NLSY):  The  NLSY  is  a  national  annual 
survey  of  12,686  young  men  and  young  women  who  where  14  to  22  years  of  age 
when  they  were  first  surveyed  in  1979.  It  contains  three  subsamples.  The  data 
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provide  a  unique  opportunity  to  study  the  life-course  experiences  of  a  large  sam¬ 
ple  of  young  adults  who  are  representative  of  American  men  and  women  born  in 
the  late  1950s  and  early  1960s.  A  second  NLSY  began  in  1997. 

Survey  of  Income  and  Program  Participation  (SIPP):  SIPP  is  a  longitudinal  survey 
of  around  8,000  housing  units  per  month.  It  covers  income  sources,  participation  in 
entitlement  programs,  correlation  between  these  items,  and  individual  attachments 
to  the  job  market  over  time.  It  is  a  multipanel  survey  with  a  new  panel  being  intro¬ 
duced  at  the  beginning  of  each  calendar  year.  The  first  panel  of  SIPP  was  initiated 
in  October  1983.  Compared  with  CPS,  SIPP  has  fewer  employed  and  more  unem¬ 
ployed  persons. 

Health  and  Retirement  Study  (HRS):  The  HRS  is  a  longitudinal  national  study. 
The  baseline  consists  of  interviews  with  members  of  7,600  households  in  1992 
(respondents  aged  from  51  to  61)  with  follow-ups  every  two  years  for  12  years.  The 
data  contain  a  wealth  of  economic,  demographic,  and  health  information. 

World  Bank’s  Living  Standards  Measurement  Study  (LSMS):  The  World  Bank’s 
LSMS  household  surveys  collect  data  “on  many  dimensions  of  household  well- 
being  that  can  be  used  to  assess  household  welfare,  understand  household  behavior, 
and  evaluate  the  effects  of  various  government  policies  on  the  living  conditions  of 
the  population”  in  many  developing  countries.  Many  examples  of  the  use  of  these 
data  can  be  found  in  Deaton  (1997)  and  in  the  economic  development  literature. 
Grosh  and  Glewwe  (1998)  outline  the  nature  of  the  data  and  provide  references  to 
research  studies  that  have  used  them. 

Data  clearinghouses:  The  Interuniversity  Consortium  for  Political  and  Social  Re¬ 
search  (ICPSR)  provides  access  to  many  data  sets,  including  the  PSID,  CPS,  NLS, 
SIPP,  National  Medical  Expenditure  Survey  (NMES),  and  many  others.  The  U.S. 
Bureau  of  Labor  Statistics  handles  the  CPS  and  NLS  surveys.  The  U.S.  Bureau  of 
Census  handles  the  SIPP.  The  U.S.  National  Center  for  Health  Statistics  provides 
access  to  many  health  data  sets.  A  useful  gateway  to  European  data  archives  is 
the  Council  of  European  Social  Science  Data  Archives  (CESSDA),  which  provides 
links  to  several  European  national  data  archives. 

Journal  data  archives:  For  some  purposes,  such  as  replication  of  published  results 
for  classroom  work,  you  can  get  the  data  from  journal  archives.  Two  archives  in 
particular  have  well-established  procedures  for  data  uploads  and  downloads  using 
an  Internet  browser.  The  Journal  of  Business  and  Economic  Statistics  archives  data 
used  in  most  but  not  all  articles  published  in  that  journal.  The  Journal  of  Applied 
Econometrics  data  archive  is  also  organized  along  similar  lines  and  contains  data 
pertaining  to  most  articles  published  since  1994. 


3.5.2.  Handling  Microdata 

Microeconomic  data  sets  tend  to  be  quite  large.  Samples  of  several  hundreds  or  thou¬ 
sands  are  common  and  even  those  of  tens  of  thousands  are  not  unusual.  The  distribu¬ 
tions  of  outcomes  of  interest  are  often  nonnormal,  in  part  because  one  is  often  dealing 
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with  discrete  data  such  as  binary  outcomes,  or  with  data  that  have  limited  variation 
such  as  proportions  or  shares,  or  with  truncated  or  censored  continuous  outcomes. 
Handling  large  nonnormal  data  sets  poses  some  problems  of  summarizing  and  report¬ 
ing  the  important  features  of  data.  Often  it  is  useful  to  use  one  computing  environment 
(program)  for  data  extraction,  reduction,  and  preparation  and  a  different  one  for  model 
estimation. 


3.5.3.  Data  Preparation 

The  most  basic  feature  of  microeconometric  analysis  is  that  the  process  of  arriving  at 
the  sample  finally  used  in  the  econometric  investigation  is  likely  to  be  a  long  one.  It 
is  important  to  accurately  document  decisions  and  choices  made  by  the  investigator  in 
the  process  of  “cleaning  up”  the  data.  Let  us  consider  some  specific  examples. 

One  of  the  most  common  features  of  sample  survey  data  is  nonresponse  or  par¬ 
tial  response.  The  problems  of  nonresponse  have  already  been  discussed.  Partial  res¬ 
ponse  usually  means  that  some  parts  of  survey  questionnaires  were  not  answered.  If 
this  means  that  some  of  the  required  information  is  not  available,  the  observations  in 
question  are  deleted.  This  is  called  listwise  deletion.  If  this  problem  occurs  in  a  sig¬ 
nificant  number  of  cases,  it  should  be  properly  analyzed  and  reported  because  it  could 
lead  to  an  unrepresentative  sample  and  biases  in  estimation.  The  issue  is  analyzed  in 
Chapter  27.  For  example,  consider  a  question  in  a  household  survey  to  which  high- 
income  households  do  not  respond,  leading  to  a  sample  in  which  these  households  are 
underrepresented.  Hence  the  end  effect  is  no  different  from  one  in  which  there  is  a  full 
response  but  the  sample  is  not  representative. 

A  second  problem  is  measurement  error  in  reported  data.  Microeconomic  data  are 
typically  noisy.  The  extent,  type,  and  seriousness  of  measurement  error  depends  on  the 
type  of  survey  cross  section  or  panel,  the  individual  who  responds  to  the  survey,  and 
the  variable  about  which  information  is  sought.  For  example,  self-reported  income  data 
from  panel  surveys  are  strongly  suspected  to  have  serially  correlated  measurement  er¬ 
ror.  In  contrast,  reported  expenditure  magnitudes  are  usually  thought  to  have  a  smaller 
measurement  error.  Deaton  (1997)  surveys  some  of  the  sources  of  measurement  er¬ 
ror  with  special  reference  to  the  World  Bank’s  Living  Standards  Measurement  Survey, 
although  several  of  the  issues  raised  have  wider  relevance.  The  biases  from  measure¬ 
ment  error  depend  on  what  is  done  to  the  data  in  terms  of  transformations  (e.g.,  first 
differencing)  and  the  estimator  used.  Hence  to  make  informative  statements  about  the 
seriousness  of  biases  from  measurement  error,  one  must  analyze  well-defined  mod¬ 
els.  Later  chapters  will  give  examples  of  the  impact  of  measurement  error  in  specific 
contexts. 


3.5.4.  Checking  Data 

In  large  data  sets  it  is  easy  to  have  erroneous  data  resulting  from  keyboard  and  cod¬ 
ing  errors.  One  should  therefore  apply  some  elementary  checks  that  would  reveal  the 
existence  of  problems.  One  can  check  the  data  before  analyzing  it  by  examining  some 
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descriptive  statistics.  The  following  techniques  are  useful.  First,  use  summary  statistics 
(min,  max,  mean,  and  median)  to  make  sure  that  the  data  are  in  the  proper  interval  and 
on  the  proper  scale.  For  instance,  categorical  variables  should  be  between  zero  and 
one,  counts  should  be  greater  than  or  equal  to  zero.  Sometimes  missing  data  are  coded 
as  —999,  or  some  other  integer,  so  take  care  not  to  treat  these  entries  as  data.  Second, 
one  should  know  whether  changes  are  fractional  or  on  a  percentage  scale.  Third,  use 
box  and  whisker  plots  to  identify  problematic  observations.  For  instance,  using  box 
and  whisker  plots  one  researcher  found  a  country  that  had  negative  population  growth 
(owing  to  a  war)  and  another  country  that  had  recorded  investment  as  more  than  GDP 
(because  foreign  aid  had  been  excluded  from  the  GDP  calculation).  Checking  observa¬ 
tions  before  proceeding  with  estimation  may  also  suggest  normalizing  transformations 
and/or  distributional  assumptions  with  features  appropriate  for  modeling  a  particular 
data  set.  Third,  screening  data  may  suggest  appropriate  data  transforms.  For  example, 
box  and  whisker  plots  and  histograms  could  suggest  which  variables  might  be  better 
modeled  via  a  log  or  power  transform.  Finally,  it  may  be  important  to  check  the  scales 
of  measurement.  For  some  purposes,  such  as  the  use  of  nonlinear  estimators,  it  may 
be  desirable  to  scale  variables  so  that  they  have  roughly  similar  scale.  Summary  statis¬ 
tics  can  be  used  to  check  that  the  means,  variances,  and  covariances  of  the  variables 
indicate  proper  scaling. 


3.5.5.  Presenting  Descriptive  Statistics 

Because  microdata  sets  are  usually  large,  it  is  essential  to  provide  the  reader  with  an 
initial  table  of  descriptive  statistics,  usually  mean,  standard  deviation,  minimum,  and 
maximum  for  every  variable.  In  some  cases  unexpectedly  large  or  small  values  may 
reveal  the  presence  of  a  gross  recording  error  or  erroneous  inclusion  of  an  incorrect 
data  point.  Two-way  scatter  diagrams  are  usually  not  helpful,  but  tabulation  of  cate¬ 
gorical  variables  (contingency  tables)  can  be.  For  discrete  variables  histograms  can  be 
useful  and  for  continuous  variables  density  plots  can  be  informative. 

3.6.  Bibliographic  Notes 

3.2  Deaton  (1997)  provides  an  introduction  to  sample  surveys  especially  for  developing 
economies.  Several  specific  references  to  complex  surveys  are  provided  in  Chapter  24. 
Becketti  et  al.  (1988)  investigate  the  importance  of  the  issue  of  representativeness  of  the 
PSID. 

3.3  The  collective  volume  edited  by  Hausman  and  Wise  (1985)  contains  several  papers  on  indi¬ 
vidual  social  experiments  including  the  RHIE,  NIT,  and  Time-of-Use  pricing  experiments. 
Several  studies  question  the  usefulness  of  the  experimental  data  and  there  is  extensive  dis¬ 
cussion  of  the  flaws  in  experimental  designs  that  preclude  clear  conclusions.  Pros  and  cons 
of  social  experiments  versus  observational  data  are  discussed  in  an  excellent  pair  of  papers 
by  Burtless  (1995)  and  Heckman  and  Smith  (1995). 

3.4  A  special  issue  of  the  Journal  of  Business  and  Economic  Statistics  (1995)  carries  a  number 
of  articles  that  use  the  methodology  of  quasi-  or  natural  experiments.  The  collection  in¬ 
cludes  an  article  by  Meyer  who  surveys  the  issues  in  and  the  methodology  of  econometric 
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studies  that  use  data  from  natural  experiments.  He  also  provides  a  valuable  set  of  guidelines 
on  the  credible  use  of  natural  variation  in  making  inferences  about  the  impact  of  economic 
policies,  partly  based  on  the  work  of  Campbell  (1969).  Kim  and  Singal  (1993)  study  the 
impact  of  changes  in  market  concentration  on  price  using  the  data  generated  by  a  airline 
mergers.  Rosenzweig  and  Wolpin  (2000)  review  an  extensive  literature  based  on  natural 
experiments  such  as  identical  twins.  Isacsson  (1999)  uses  the  twins  approach  to  study  re¬ 
turns  to  schooling  using  Swedish  data.  Angrist  and  Lavy  (1999)  study  the  impact  of  class 
size  on  test  scores  using  data  from  schools  that  are  subject  to  “Maimonides’  Rule”  (briefly 
reviewed  in  Section  25.6),  which  states  that  class  size  should  not  exceed  40.  The  rule  gen¬ 
erates  an  instrument. 
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Part  2  presents  the  core  estimation  methods  -  least  squares,  maximum  likelihood  and 
method  of  moments  -  and  associated  methods  of  inference  for  nonlinear  regression 
models  that  are  central  in  microeconometrics.  The  material  also  includes  modern  top¬ 
ics  such  as  quantile  regression,  sequential  estimation,  empirical  likelihood,  semipara- 
metric  and  nonparametric  regression,  and  statistical  inference  based  on  the  bootstrap. 
In  general  the  discussion  is  at  a  level  intended  to  provide  enough  background  and 
detail  to  enable  the  practitioner  to  read  and  comprehend  articles  in  the  leading  econo¬ 
metrics  journals  and,  where  needed,  subsequent  chapters  of  this  book.  We  presume 
prior  familiarity  with  linear  regression  analysis. 

The  essential  estimation  theory  is  presented  in  three  chapters.  Chapter  4  begins  with 
the  linear  regression  model.  It  then  covers  at  an  introductory  level  quantile  regression, 
which  models  distributional  features  other  than  the  conditional  mean.  It  provides  a 
lengthy  expository  treatment  of  instrumental  variables  estimation,  a  major  method  of 
causal  inference.  Chapter  5  presents  the  most  commonly-used  estimation  methods  for 
nonlinear  models,  beginning  with  the  topic  of  m-estimation,  before  specialization  to 
maximum  likelihood  and  nonlinear  least  squares  regression.  Chapter  6  provides  a  com¬ 
prehensive  treatment  of  generalized  method  of  moments,  which  is  a  quite  general  esti¬ 
mation  framework  that  is  applicable  for  linear  and  nonlinear  models  in  single-equation 
and  multi-equation  settings.  The  chapter  emphasizes  the  special  case  of  instrumental 
variables  estimation. 

We  then  turn  to  model  testing.  Chapter  7  covers  both  the  classical  and  bootstrap 
approaches  to  hypothesis  testing,  while  Chapter  8  presents  relatively  more  modern 
methods  of  model  selection  and  specification  analysis.  Because  of  their  importance 
the  computationally-intensive  bootstrap  methods  are  also  the  subject  of  a  more  de¬ 
tailed  chapter,  Chapter  1 1  in  Part  3.  A  distinctive  feature  of  this  book  is  that,  as  much 
as  possible,  testing  procedures  are  presented  in  a  unified  manner  in  just  these  three 
chapters.  The  procedures  are  then  illustrated  in  specific  applications  throughout  the 
book. 

Chapter  9  is  a  stand-alone  chapter  that  presents  nonparametric  and  semiparametric 
estimation  methods  that  place  a  flexible  structure  on  the  econometric  model. 
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Chapter  10  presents  the  computational  methods  used  to  compute  the  nonlinear  esti¬ 
mators  presented  in  chapters  5  and  6.  This  material  becomes  especially  relevant  to  the 
practitioner  if  an  estimator  is  not  automatically  computed  by  an  econometrics  package, 
or  if  numerical  difficulties  are  encountered  in  model  estimation. 
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Linear  Models 


4.1.  Introduction 

A  great  deal  of  empirical  microeconometrics  research  uses  linear  regression  and  its 
various  extensions.  Before  moving  to  nonlinear  models,  the  emphasis  of  this  book, 
we  provide  a  summary  of  some  important  results  for  the  single-equation  linear  regres¬ 
sion  model  with  cross-section  data.  Several  different  estimators  in  the  linear  regression 
model  are  presented. 

Ordinary  least-squares  (OLS)  estimation  is  especially  popular.  For  typical  microe- 
conometric  cross-section  data  the  model  error  terms  are  likely  to  be  heteroskedas- 
tic.  Then  statistical  inference  should  be  robust  to  heteroskedastic  errors  and  efficiency 
gains  are  possible  by  use  of  weighted  rather  than  ordinary  least  squares. 

The  OLS  estimator  minimizes  the  sum  of  squared  residuals.  One  alternative  is  to 
minimize  the  sum  of  the  absolute  value  of  residuals,  leading  to  the  least  absolute  de¬ 
viations  estimator.  This  estimator  is  also  presented,  along  with  extension  to  quantile 
regression. 

Various  model  misspecifications  can  lead  to  inconsistency  of  least-squares  estima¬ 
tors.  In  such  cases  inference  about  economically  interesting  parameters  may  require 
more  advanced  procedures  and  these  are  pursued  at  considerable  length  and  depth  else¬ 
where  in  the  book.  One  commonly  used  procedure  is  instrumental  variables  regression. 
The  current  chapter  provides  an  introductory  treatment  of  this  important  method  and 
additionally  addresses  the  complication  of  weak  instruments. 

Section  4.2  provides  a  definition  of  regression  and  presents  various  loss  functions 
that  lead  to  different  estimators  for  the  regression  function.  An  example  is  introduced 
in  Section  4.3.  Some  leading  estimation  procedures,  specifically  ordinary  least  squares, 
weighted  least  squares,  and  quantile  regression,  are  presented  in,  respectively.  Sec¬ 
tions  4.4,  4.5,  and  4.6.  Model  misspecification  is  considered  in  Section  4.7.  Instru¬ 
mental  variables  regression  is  presented  in  Sections  4.8  and  4.9.  Sections  4. 3-4.5,  4.7, 
and  4.8  cover  standard  material  in  introductory  courses,  whereas  Sections  4.2,  4.6,  and 
4.9  introduce  more  advanced  material. 
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4.2.  Regressions  and  Loss  Functions 

In  modern  microeconometrics  the  term  regression  refers  to  a  bewildering  range  of 
procedures  for  studying  the  relationship  between  an  outcome  variable  y  and  a  set  of 
regressors  x.  It  is  helpful,  therefore,  to  state  at  the  beginning  the  motivation  and  justi¬ 
fication  for  some  of  the  leading  types  of  regressions. 

For  exposition  it  is  convenient  to  think  of  the  purpose  of  regression  to  be  condi¬ 
tional  prediction  of  y  given  x.  In  practice,  regression  models  are  also  used  for  other 
purposes,  most  notably  causal  inference.  Even  then  a  prediction  function  constitutes  a 
useful  data  summary  and  is  still  of  interest.  In  particular,  see  Section  4.2.3  for  the  dis¬ 
tinction  between  linear  prediction  and  causal  inference  based  on  a  linear  causal  mean. 


4.2.1.  Loss  Functions 

Let  y  denote  the  predictor  defined  as  a  function  of  x.  Let  e  =  y  y  denote  the  pre¬ 
diction  error,  and  let 

Ue)  =  Uy-y)  (4.1) 

denote  the  loss  associated  with  the  error  e.  As  in  decision  analysis  we  assume  that  the 
predictor  forms  the  basis  of  some  decision,  and  the  prediction  error  leads  to  disutility 
on  the  part  of  the  decision  maker  that  is  captured  by  Lie),  whose  precise  functional 
form  is  a  choice  of  the  decision  maker.  The  loss  function  has  the  property  that  it  is 
increasing  in  \e\. 

Treating  (y ,  y)  as  random,  the  decision  maker  minimizes  the  expected  value  of  the 
loss  function,  denoted  E[L(e)| .  If  the  predictor  depends  on  x,  a  K -dimensional  vector, 
then  expected  loss  is  expressed  as 

E[L((y-y)|x)].  (4.2) 

The  choice  of  the  loss  function  should  depend  in  a  substantive  way  on  the  losses 
associated  with  prediction  errors.  In  some  situations,  such  as  weather  forecasting,  there 
may  be  a  sound  basis  for  choosing  one  loss  function  over  another. 

In  econometrics,  there  is  often  no  clear  guide  and  the  convention  is  to  specify 
quadratic  loss.  Then  (4.1)  specializes  to  L(e)  =  e2  and  by  (4.2)  the  optimal  predic¬ 
tor  minimizes  the  expected  loss  E[L(e|x)]  =  E[e2|x].  It  follows  that  in  this  case  the 
minimum  mean-squared  prediction  error  criterion  is  used  to  compare  predictors. 


4.2.2.  Optimal  Prediction 

The  decision  theory  approach  to  choosing  the  optimal  predictor  is  framed  in  terms  of 

minimizing  expected  loss, 

min  E  [L  (y  —  y)|x)] . 

y 

Thus  the  optimality  property  is  relative  to  the  loss  function  of  the  decision  maker. 
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Table  4.1.  Loss  Functions  and  Corresponding  Optimal  Predictors 


Type  of  Loss  Function 

Definition 

Optimal  Predictor 

Squared  error  loss 

He)  =  e 

,2 

E[y|x] 

Absolute  error  loss 

He)  = 

e\ 

med[y|x] 

Asymmetric  absolute  loss 

He)  = 

\  (1  —  a)  \e\  if  e  <  0 
|  a  e  if  e  Si  0 

qa  [y|x] 

Step  loss 

He)  = 

[  0  if  e  <  0 
[  1  ife  ^  0 

mod[y  |x] 

Four  leading  examples  of  loss  function,  and  the  associated  optimal  predictor  func¬ 
tion,  are  given  in  Table  4. 1 .  We  provide  a  brief  presentation  for  each  in  turn.  A  detailed 
analysis  is  given  in  Manski  (1988a). 

The  most  well  known  loss  function  is  the  squared  error  loss  (or  mean-square  loss) 
function.  Then  the  optimal  predictor  of  y  is  the  conditional  mean  function,  E[y  |x|.  In 
the  most  general  case  no  structure  is  placed  on  E[y  |x]  and  estimation  is  by  nonpara- 
metric  regression  (see  Chapter  9).  More  often  a  model  for  E[y|x]  is  specified,  with 
E[y|x]  =  g(x,  (3),  where  g(-)  is  a  specified  function  and  /3  is  a  finite-dimensional  vec¬ 
tor  of  parameters  that  needs  to  be  estimated.  The  optimal  prediction  is  y  =  g(x,(3), 
where  (3  is  chosen  to  minimize  the  in-sample  loss 

^L(e(  )  =  J2ei  =  XXV''  “  g(Xl’ /3))2- 
1  =  1  1  =  1  1  =  1 

The  loss  function  is  the  sum  of  squared  residuals,  so  estimation  is  by  nonlinear  least 
squares  (see  Section  5.8).  If  the  conditional  mean  function  g(-)  is  restricted  to  be  linear 
in  x  and  (3,  so  that  E[y  |x]  =  x!  (3,  then  the  optimal  predictor  is  y  =  x'f3,  where  [3  is  the 
ordinary  least-squares  estimator  detailed  in  Section  4.4. 

If  the  loss  criterion  is  absolute  error  loss,  then  the  optimal  predictor  is  the  con¬ 
ditional  median,  denoted  med[y|x].  If  the  conditional  median  function  is  linear,  so 
that  med[y|x]  =  x' (3,  then  the  optimal  predictor  is  y  =  x! (3,  where  (3  is  the  least  abso¬ 
lute  deviations  estimator  that  minimizes  JA  [ y,  —  xj/3|.  This  estimator  is  presented  in 
Section  4.6. 

Both  the  squared  error  and  absolute  error  loss  functions  are  symmetric,  so  the  same 
penalty  is  imposed  for  prediction  error  of  a  given  magnitude  regardless  of  the  direc¬ 
tion  of  the  prediction  error.  Asymmetric  absolute  error  loss  instead  places  a  penalty 
of  (1  —  a)  \e\  on  overprediction  and  a  different  penalty  a  \e\  on  underprediction.  The 
asymmetry  parameter  a  is  specified.  It  lies  in  the  interval  (0,  1)  with  symmetry  when 
a  =  0.5  and  increasing  asymmetry  as  a  approaches  0  or  1.  The  optimal  predictor  can 
be  shown  to  be  the  conditional  quantile,  denoted  qa  [y|x];  a  special  case  is  the  condi¬ 
tional  median  when  a  =  0.5.  Conditional  quantiles  are  defined  in  Section  4.6,  which 
presents  quantile  regression  (Koenker  and  Bassett,  1978). 

The  last  loss  function  given  in  Table  4. 1  is  step  loss,  which  bases  the  loss  simply  on 
the  sign  of  the  prediction  error  regardless  of  the  magnitude.  The  optimal  predictor  is  the 
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conditional  mode,  denoted  mod[y |x|.  This  provides  motivation  for  mode  regression 
(Lee,  1989). 

Maximum  likelihood  does  not  fall  as  easily  into  the  prediction  framework  of  this 
section.  It  can,  however,  be  given  an  expected  loss  interpretation  in  terms  of  predicting 
the  density  and  minimizing  Kullback-Liebler  information  (see  Section  5.7). 

The  results  just  stated  imply  that  the  econometrician  interested  in  estimating  a  pre¬ 
diction  function  from  the  data  (y,  x)  should  choose  the  prediction  function  according 
to  the  loss  function.  The  use  of  the  popular  linear  regression  implies,  at  least  implicitly, 
that  the  decision  maker  has  a  quadratic  loss  function  and  believes  that  the  conditional 
mean  function  is  linear.  However,  if  one  of  the  other  three  loss  functions  is  specified, 
then  the  optimal  predictor  will  be  based  on  one  of  the  three  other  types  of  regressions. 
In  practice  there  can  be  no  clear  reason  for  preferring  a  particular  loss  function. 

Regressions  are  often  used  as  data  summaries,  rather  than  for  prediction  per  se. 
Then  it  can  be  useful  to  consider  a  range  of  estimators,  as  alternative  estimators  may 
provide  useful  information  about  the  sensitivity  of  estimates.  Manski  (1988a,  1991) 
has  pointed  out  that  the  quadratic  and  absolute  error  loss  functions  are  both  convex.  If 
the  conditional  distribution  of  y  |x  is  symmetric  then  the  conditional  mean  and  median 
estimators  are  both  consistent  and  can  be  expected  to  be  quite  close.  Furthermore,  if 
one  avoids  assumptions  about  the  distribution  of  yjx,  then  differences  in  alternative 
estimators  provide  a  way  of  learning  about  the  data  distribution. 

4.2.3.  Linear  Prediction 

The  optimal  predictor  under  squared  error  loss  is  the  conditional  mean  E[y|x].  If  this 
conditional  mean  is  linear  in  x,  so  that  E[y|x]  =  x'/3,  the  parameter  f3  has  a  structural 
or  causal  interpretation  and  consistent  estimation  of  (3  by  OLS  implies  consistent  esti¬ 
mation  of  E[y|x]  =  x' (3.  This  permits  meaningful  policy  analysis  of  effects  of  changes 
in  regressors  on  the  conditional  mean. 

If  instead  the  conditional  mean  is  nonlinear  in  x,  so  that  E[y  |x]  /  x'/3,  the  structural 
interpretation  of  OLS  disappears.  However,  it  is  still  possible  to  interpret  (3  as  the  best 
linear  predictor  under  squared  error  loss.  Differentiation  of  the  expected  loss  E[(y  — 
x! (3)2]  with  respect  to  f3  yields  first-order  conditions  — 2E[x(y  —  x'/3)]  =  0,  so  the  opti¬ 
mal  linear  predictor  is  /3  =  (E[xx'])  E[xy  ]  with  sample  analogue  the  OLS  estimator. 

Usually  we  specialize  to  models  with  intercept.  In  a  change  of  notation  we  define  x 
to  denote  regressors  excluding  the  intercept,  and  we  replace  x! (3  by  a  +  x'7.  The  first- 
order  conditions  with  respect  to  a  and  7  are  that  — 2E[m]  =  0  and  — 2E[xh]  =  0,  where 
u  =  y  —  (a  +  x'7).  These  imply  that  E[n]  =  0  and  Cov[x,n]  =  0.  Solving  yields 

7  =  (V[x]r1Cov[x,y],  (4.3) 

a  =  E[y]-E[x']7; 

see,  for  example,  Goldberger  (1991,  p.  52). 

From  the  derivation  of  (4.3)  it  should  be  clear  that  for  data  (y ,  x)  we  can  always 
write  a  linear  regression  model 

y  =  a  +  x'7  +  M,  (4.4) 
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where  the  parameters  a  and  7  are  defined  in  (4.3)  and  the  error  term  u  satisfies  E[  u  ]  = 
0  and  Cov[x,m]  =  0. 

A  linear  regression  model  can  therefore  always  be  given  the  nonstructural  or  re¬ 
duced  form  interpretation  as  the  best  linear  prediction  (or  linear  projection)  un¬ 
der  squared  error  loss.  However,  for  the  conditional  mean  to  be  linear  in  x,  so  that 
E[)jx]  =  tf+x'7,  requires  the  assumption  that  E[m|x]  =  0,  in  addition  to  E[m]  =  0  and 
Cov[x,m]  =  0. 

This  distinction  is  of  practical  importance.  For  example,  if  E[w|x]  =  0,  so  that 
E[)jx]  =  cy+x'7.  then  the  probability  limit  of  a  least-squares  (LS)  estimator  7  is  7 
regardless  of  whether  the  LS  estimator  is  weighted  or  unweighted,  or  whether  the 
sample  is  obtained  by  simple  random  sampling  or  by  exogenous  stratified  sampling.  If 
instead  E[y|x]  ^a+x'7  then  these  different  LS  estimators  may  have  different  proba¬ 
bility  limits.  This  example  is  discussed  further  in  Section  24.3. 

A  structural  interpretation  of  OLS  requires  that  the  conditional  mean  of  the  error 
term,  given  regressors,  equals  zero. 


4.3.  Example:  Returns  to  Schooling 

A  leading  linear  regression  application  from  labor  economics  concerns  measuring  the 
impact  of  education  on  wages  or  earnings. 

A  typical  returns  to  schooling  model  specifies 

In  Wi  =  o '.Sj  +  x'2(/3  +  m, ,  i  =  1 . N,  (4.5) 

where  w  denotes  hourly  wage  or  annual  earnings,  s  denotes  years  of  completed  school¬ 
ing,  and  X2  denotes  control  variables  such  as  work  experience,  gender,  and  family 
background.  The  subscript  i  denotes  the  /th  person  in  the  sample.  Since  the  dependent 
variable  is  log  wage,  the  model  is  a  log-linear  model  and  the  coefficient  a  measures 
the  proportionate  change  in  earnings  associated  with  a  one-year  increase  in  education. 

Estimation  of  this  model  is  most  often  by  ordinary  least  squares.  The  transforma¬ 
tion  to  In  w  in  practice  ensures  that  errors  are  approximately  homoskedastic,  but  it 
is  still  best  to  obtain  heteroskedastic  consistent  standard  errors  as  detailed  in  Sec¬ 
tion  4.4.  Estimation  can  also  be  by  quantile  regression  (see  Section  4.6),  if  interest 
lies  in  distributional  issues  such  as  behavior  in  the  lower  quartile. 

The  regression  (4.5)  can  be  used  immediately  in  a  descriptive  manner.  For  exam¬ 
ple,  if  a  =  0.10  then  a  one-year  increase  in  schooling  is  associated  with  10%  higher 
earnings,  controlling  for  all  the  factors  included  in  x2.  It  is  important  to  add  the  last 
qualifier  as  in  this  example  the  estimate  a  usually  becomes  smaller  as  x2  is  expanded 
to  include  additional  controls  likely  to  influence  earnings. 

Policy  interest  lies  in  determining  the  impact  of  an  exogenous  change  in  schooling 
on  earnings.  However,  schooling  is  not  randomly  assigned;  rather,  it  is  an  outcome  that 
depends  on  choices  made  by  the  individual.  Human  capital  theory  treats  schooling  as 
investment  by  individuals  in  themselves,  and  a  is  interpreted  as  a  measure  of  return  to 
human  capital.  The  regression  (4.5)  is  then  a  regression  of  one  endogenous  variable, 
y,  on  another,  3,  and  so  does  not  measure  the  causal  impact  of  an  exogenous  change 
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in  s.  The  conditional  mean  function  here  is  not  causally  meaningful  because  one  is 
conditioning  on  a  factor,  schooling,  that  is  endogenous.  Indeed,  unless  we  can  argue 
that  ,9  is  itself  a  function  of  variables  at  least  one  of  which  can  vary  independently  of 
u,  it  is  unclear  just  what  it  means  to  regard  a  as  a  causal  parameter. 

Such  concern  about  endogenous  regressors  with  observational  data  on  individuals 
pervades  microeconometric  analysis.  The  standard  assumptions  of  the  linear  regres¬ 
sion  model  given  in  Section  4.4  are  that  regressors  are  exogenous.  The  consequences 
of  endogenous  regressors  are  considered  in  Section  4.7.  One  method  to  control  for 
endogenous  regressors,  instrumental  variables,  is  detailed  in  Section  4.8.  A  recent  ex¬ 
tensive  review  of  ways  to  control  for  endogeneity  in  this  wage-schooling  example  is 
given  in  Angrist  and  Krueger  (1999).  These  methods  are  summarized  in  Section  2.8 
and  presented  throughout  this  book. 

4.4.  Ordinary  Least  Squares 

The  simplest  example  of  regression  is  the  OLS  estimator  in  the  linear  regression  model. 

After  first  defining  the  model  and  estimator,  a  quite  detailed  presentation  of  the 
asymptotic  distribution  of  the  OLS  estimator  is  given.  The  exposition  presumes  pre¬ 
vious  exposure  to  a  more  introductory  treatment.  The  model  assumptions  made  here 
permit  stochastic  regressors  and  heteroskedastic  errors  and  accommodate  data  that  are 
obtained  by  exogenous  stratified  sampling. 

The  key  result  of  how  to  obtain  heteroskedastic -robust  standard  errors  of  the  OLS 
estimator  is  given  in  Section  4.4.5. 

4.4.1.  Linear  Regression  Model 

In  a  standard  cross-section  regression  model  with  N  observations  on  a  scalar 
dependent  variable  and  several  regressors,  the  data  are  specified  as  (y,  X),  where  y 
denotes  observations  on  the  dependent  variable  and  X  denotes  a  matrix  of  explanatory 
variables. 

The  general  regression  model  with  additive  errors  is  written  in  vector  notation  as 

y  =  E[y|X]+u,  (4.6) 

where  E[y|X]  denotes  the  conditional  expectation  of  the  random  variable  y  given  X, 
and  u  denotes  a  vector  of  unobserved  random  errors  or  disturbances.  The  right-hand 
side  of  this  equation  decomposes  y  into  two  components,  one  that  is  deterministic 
given  the  regressors  and  one  that  is  attributed  to  random  variation  or  noise.  We  think 
of  E[y|X]  as  a  conditional  prediction  function  that  yields  the  average  value,  or  more 
formally  the  expected  value,  of  y  given  X. 

A  linear  regression  model  is  obtained  when  E[y|X]  is  specified  to  be  a  linear  func¬ 
tion  of  X.  Notation  for  this  model  has  been  presented  in  detail  in  Section  1.6.  In  vector 
notation  the  ith  observation  is 


yt  =  x'i/3+Uj, 


(4.7) 
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where  x,  is  a  K  x  1  regressor  vector  and  [3  is  a  K  x  1  parameter  vector.  At  times 
it  is  simpler  to  drop  the  subscript  i  and  write  the  model  for  typical  observation  as 
_v  =  x'/3  +  u.  In  matrix  notation  the  N  observations  are  stacked  by  row  to  yield 

y  =  X/3  +  u,  (4.8) 

where  y  is  an  N  x  1  vector  of  dependent  variables,  X  is  an  N  x  K  regression  ma¬ 
trix,  and  u  is  an  N  x  1  error  vector. 

Equations  (4.7)  and  (4.8)  are  equivalent  expressions  for  the  linear  regression  model 
and  will  be  used  interchangeably.  The  latter  is  more  concise  and  is  usually  the  most 
convenient  representation. 

In  this  setting  y  is  referred  to  as  the  dependent  variable  or  endogenous  variable 
whose  variation  we  wish  to  study  in  terms  of  variation  in  x  and  u\  u  is  referred  to  as 
the  error  term  or  disturbance  term;  and  x  is  referred  to  as  regressors  or  predictors 
or  couariates.  If  Assumption  4  in  Section  4.4.6  holds,  then  all  components  of  x  are 

exogenous  variables  or  independent  variables. 


4.4.2.  OLS  Estimator 

The  OLS  estimator  is  defined  to  be  the  estimator  that  minimizes  the  sum  of  squared 
errors 

N 

Y  u]  =  u'u  =  (y  -  X/3)'(y  -  X/3).  (4.9) 

(=i 

Setting  the  derivative  with  respect  to  (3  equal  to  0  and  solving  for  (3  yields  the  OLS 
estimator, 

3ols  =  (X'XT'X'y,  (4.10) 

see  Exercise  4.5  for  a  more  general  result,  where  it  is  assumed  that  the  matrix  inverse  of 
X'X  exists.  If  X'X  is  of  less  than  full  rank,  the  inverse  can  be  replaced  by  a  generalized 
inverse.  Then  OLS  estimation  still  yields  the  optimal  linear  predictor  of  y  given  x  if 
squared  error  loss  is  used,  but  many  different  linear  combinations  of  x  will  yield  this 
optimal  predictor. 


4.4.3.  Identification 

The  OLS  estimator  can  always  be  computed,  provided  that  X'X  is  nonsingular.  The 
more  interesting  issue  is  what  /3OLS  tells  us  about  the  data. 

We  focus  on  the  ability  of  the  OLS  estimator  to  permit  identification  (see  Section 
2.5)  of  the  conditional  mean  E[y|X].  Lor  the  linear  model  the  parameter  (3  is  identified 
if 

1.  E[y|X]  =  X/3  and 

2.  X/3(1)  =  X/3(2)  if  and  only  if  /3(1)  =  (3a\ 
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The  first  condition  that  the  conditional  mean  is  correctly  specified  ensures  that  / 3  is 
of  intrinsic  interest;  the  second  assumption  implies  that  X'X  is  nonsingular,  which  is 
the  same  condition  needed  to  compute  the  unique  OLS  estimate  (4.10). 


4.4.4.  Distribution  of  the  OLS  Estimator 

We  focus  on  the  asymptotic  properties  of  the  OLS  estimator.  Consistency  is  estab¬ 
lished  and  then  the  limit  distribution  is  obtained  by  rescaling  the  OLS  estimator. 
Statistical  inference  then  requires  consistent  estimation  of  the  variance  matrix  of  the 
estimator.  The  analysis  makes  extensive  use  of  asymptotic  theory,  which  is  summa¬ 
rized  in  Appendix  A. 


Consistency 

The  properties  of  an  estimator  depend  on  the  process  that  actually  generated  the  data, 
the  data  generating  process  (dgp).  We  assume  the  dgp  is  y  =  X(3  +  u,  so  that  the 
model  (4.8)  is  correctly  specified.  In  some  places,  notably  Chapters  5  and  6  and  Ap¬ 
pendix  A  the  subscript  0  is  added  to  (3,  so  the  dgp  is  y  =  X(3q  +  u.  See  Section  5.2.3 
for  discussion. 

Then 


3ols  =  (X'X)-‘X'y 

=  (X'Xr'XW  +  u) 

=  (X'X)-'X'X/3  +  (X'Xr'X'u, 


and  the  OLS  estimator  can  be  expressed  as 

3ols  =  (3  +  (X'X)_1X'u.  (4.11) 

To  prove  consistency  we  rewrite  (4. 1 1 )  as 

3ols  =  £  +  (iV^X'Xf 1  Ar'X'u.  (4.12) 

The  reason  for  renormalization  in  the  right-hand  side  is  that  /V  'X'X  =  N  1  JT  x,x' 
is  an  average  that  converges  in  probability  to  a  finite  nonzero  matrix  if  x,  satisfies 
assumptions  that  permit  a  law  of  large  numbers  to  be  applied  to  x,  x'  (see  Section  4.4.8 
for  detail).  Then 

plim/30LS  =  f3  +  (plim  A_1X'X)  1  (plim lV_1X'u)  , 


using  Slutsky’s  Theorem  (Theorem  A. 3).  The  OLS  estimator  is  consistent  for  (3  (i.e., 
plim  3ols  =  /3) if 


plimAT'x'u  =  0. 


(4.13) 


If  a  law  of  large  numbers  can  be  applied  to  the  average  N  1  X  u  =  N  1  JT  x,n,  then 
a  necessary  condition  for  (4.13)  to  hold  is  that  E[x,  n,]  =  0. 
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Limit  Distribution 

Given  consistency,  the  limit  distribution  of  /3OLS  is  degenerate  with  all  the  mass  at  (3. 
To  obtain  the  limit  distribution  we  multiply  /3OLS  by  \/~N,  as  this  rescaling  leads  to  a 
random  variable  that  under  standard  cross-section  assumptions  has  nonzero  yet  finite 
variance  asymptotically.  Then  (4.1 1)  becomes 

Vn0OLS  -  P)  =  {N-'X'Xy1  N~l/2X'u.  (4.14) 

The  proof  of  consistency  assumed  that  plim  /V  'X'X  exists  and  is  finite  and  nonzero. 
We  assume  that  a  central  limit  theorem  can  be  applied  to  N  1//2X'u  to  yield  a  multi¬ 
variate  normal  limit  distribution  with  finite,  nonsingular  covariance  matrix.  Applying 
the  product  rule  for  limit  normal  distributions  (Theorem  A.  17)  implies  that  the  product 
in  the  right-hand  side  of  (4.14)  has  a  limit  normal  distribution.  Details  are  provided  in 
Section  4.4.8. 

This  leads  to  the  following  proposition,  which  permits  regressors  to  be  stochastic 
and  does  not  restrict  model  errors  to  be  homoskedastic  and  uncorrelated. 


Proposition  4.1  (Distribution  of  OLS  Estimator).  Make  the  following  assump¬ 
tions: 


(i)  The  dgp  is  model  (4.8),  that  is,  y  =  X/3  +  u. 

(ii)  Data  are  independent  over  i  with  E[u|X]  =  0  and  E[uu'|X]  —  kl  —  Diag[cr(2]. 
(in)  The  matrix  X  is  of  full  rank  so  that  X/3(1)  =  X/3(2)  iff  /3(1>  —  /3t2). 

(iv)  The  K  x  K  matrix 


j  Af  l  N 

Mxx  =  plim  A-1  X'X  =  plim  —  ^  x;x'  =  lim  —  ^  E[x,x-]  (4.15) 

1  =  1  1  =  1 


exists  and  is  finite  nonsingular. 

(v)  The  K  x  1  vector  iV_1/2X'u  =N~l/2  Y^=1  x,  u,  -4-  J\f  [0.  Mxnx],  where 


1  n  i  N 

Mxox  =  pliiTi  /V~  1  X'uu'X  =  plim  —  «;2x,x'  =  lim  —  'y'  E[n2x,X;]. 

1  =  1  1  =  1 

(4.16) 


Then  the  OLS  estimator  (3OLS  defined  in  (4.10)  is  consistent  for  f)  and 

sfN(PoLS  -  /3)  4  Af  [0,  Mxx  MxnxMxx  ] .  (4. 17) 

Assumption  (i)  is  used  to  obtain  (4.1 1).  Assumption  (ii)  ensures  E[y|X]  =  X/3  and 
permits  heterostedastic  errors  with  variance  a2,  more  general  than  the  homoskedastic 
uncorrelated  errors  that  restrict  tt  =  a2 1.  Assumption  (iii)  rules  out  perfect  collinear- 
ity  among  the  regressors.  Assumption  (iv)  leads  to  the  rescaling  of  X'X  by  /V~'  in 
(4.12)  and  (4.14).  Note  that  by  a  law  of  large  numbers  plim  =  lim  E  (see  Appendix 
Section  A.3). 

The  essential  condition  for  consistency  is  (4.13).  Rather  than  directly  assume  this 
we  have  used  the  stronger  assumption  (v)  which  is  needed  to  obtain  result  (4.17). 
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Given  that  )V_1/2X'u  has  a  limit  distribution  with  zero  mean  and  finite  variance,  mul¬ 
tiplication  by  N~1/2  yields  a  random  variable  that  converges  in  probability  to  zero  and 
so  (4.13)  holds  as  desired.  Assumption  (v)  is  required,  along  with  assumption  (iv),  to 
obtain  the  limit  normal  result  (4.17),  which  by  Theorem  A.17  then  follows  immedi¬ 
ately  from  (4.14).  More  primitive  assumptions  on  u,  and  x,  that  ensure  (iv)  and  (v)  are 
satisfied  are  given  in  Section  4.4.6,  with  formal  proof  in  Section  4.4.8. 


Asymptotic  Distribution 

Proposition  4.1  gives  the  limit  distribution  of  \/N((3 OLs  —  /3),  a  rescaling  of  /3OLS. 
Many  practitioners  prefer  to  see  asymptotic  results  written  directly  in  terms  of  the  dis¬ 
tribution  of  /30LS,  in  which  case  the  distribution  is  called  an  asymptotic  distribution. 
This  asymptotic  distribution  is  interpreted  as  being  applicable  in  large  samples,  mean¬ 
ing  samples  large  enough  for  the  limit  distribution  to  be  a  good  approximation  but  not 
so  large  that  /3OLS  (3  as  then  its  asymptotic  distribution  would  be  degenerate.  The 
discussion  mirrors  that  in  Appendix  A.6.4. 

The  asymptotic  distribution  is  obtained  from  (4.17)  by  division  by  \/N  and  addition 
of  (3.  This  yields  the  asymptotic  distribution 

3ols  ~  Af  ,  (4.18) 

where  the  symbol  ~  means  is  “asymptotically  distributed  as.”  The  variance  matrix 
in  (4.18)  is  called  the  asymptotic  variance  matrix  of  /30ls  and  is  denoted  V[/3OLS]. 
Even  simpler  notation  drops  the  limits  and  expectations  in  the  definitions  of  Mxx  and 
Mxnx  and  the  asymptotic  distribution  is  denoted 

3ols  ~  Af  |^/3 ,  (X'X)-  1  X'fJX(X'X)- 1  j  ,  (4.19) 

and  V[/30ls]  is  defined  to  be  the  variance  matrix  in  (4.19). 

We  use  both  (4.18)  and  (4.19)  to  represent  the  asymptotic  distribution  in  later  chap¬ 
ters.  Their  use  is  for  convenience  of  presentation.  Formal  asymptotic  results  for  statisti¬ 
cal  inference  are  based  on  the  limit  distribution  rather  than  the  asymptotic  distribution. 

For  implementation,  the  matrices  Mxx  and  Mxfjx  in  (4.17)  or  (4.18)  are  replaced  by 
consistent  estimates  Mxx  and  Mxfjx.  Then  the  estimated  asymptotic  variance  matrix 
of  3ols  is 

V[3ols1  =  AT'M-'M^M-1.  (4.20) 

This  estimate  is  called  a  sandwich  estimate,  with  Mxnx  sandwiched  between  Mxx 
and  MVJ . 


4.4.5.  Heteroskedasticity-Robust  Standard  Errors  for  OLS 

The  obvious  choice  for  Mxx  in  (4.20)  is  N  ” 1  X'X.  Estimation  of  Mxnx  defined  in  (4.16) 
depends  on  assumptions  made  about  the  error  term. 

In  microeconometrics  applications  the  model  errors  are  often  conditionally  het- 
eroskedastic,  with  V[n,  |x,]  =  E[»(2|x,  |  =  cry  varying  over  i.  White  (1980a)  proposed 
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using  Mxhx  =  1  TtjXjX;.  This  estimate  requires  additional  assumptions  given  in 

Section  4.4.8. 

Combining  these  estimates  Mxx  and  Mxnx  and  simplifying  yields  the  estimated 
asymptotic  variance  matrix  estimate 


v[3olS]  =  (x'xr'xmix'x)-1 


(4.21) 


=  ( Exx- )  E^(E: 


V=1  /  *=!  V=1  J 

where  $7  =  D  i  ag  [ n2  ]  and  Tt,  =  y,  —  x'/3  is  the  OLS  residual.  This  estimate,  due  to 
White  (1980a),  is  called  the  heteroskedastic-consistent  estimate  of  the  asymptotic 
variance  matrix  of  the  OLS  estimator,  and  it  leads  to  standard  errors  that  are  called 
heteroskedasticity-robust  standard  errors,  or  even  more  simply  robust  standard 
errors.  It  provides  a  consistent  estimate  of  V[/3OLS]  even  though  Tr  is  not  consistent 
for  a2. 

In  introductory  courses  the  errors  are  restricted  to  be  homoskedastic.  Then  12  =  er2I 
so  that  X'f2X  =  er2X'X  and  hence  Mx^x  =  cr2Mxx.  The  limit  distribution  variance  ma¬ 
trix  in  (4.17)  simplifies  to  a2Mxxl ,  and  many  computer  packages  instead  use  what  is 
sometimes  called  the  default  OLS  variance  estimate 


V[3ols]  =  s2(X'Xr',  (4.22) 

where  s2  =  (N  —  K)~l  «?. 

Inference  based  on  (4.22)  rather  than  (4.21)  is  invalid,  unless  errors  are  ho¬ 
moskedastic  and  uncorrelated.  In  general  the  erroneous  use  of  (4.22)  when  errors  are 
heteroskedastic,  as  is  often  the  case  for  cross-section  data,  can  lead  to  either  inflation 
or  deflation  of  the  true  standard  errors. 

In  practice  Mxnx  is  calculated  using  division  by  (N  —  K),  rather  than  by  N,  to  be 
consistent  with  the  similar  division  in  forming  s2  in  the  homoskedastic  case.  Then 
V[/3OLs]  in  (4-21)  is  multiplied  by  N/(N  —  K).  With  heteroskedastic  errors  there  is 
no  theoretical  basis  for  this  adjustment  for  degrees  of  freedom,  but  some  simulation 
studies  provide  support  (see  MacKinnon  and  White,  1985,  and  Long  and  Ervin,  2000). 

Microeconometric  analysis  uses  robust  standard  errors  wherever  possible.  Here  the 
errors  are  robust  to  heteroskedasticity.  Guarding  against  other  misspecifications  may 
also  be  warranted.  In  particular,  when  data  are  clustered  the  standard  errors  should 
additionally  be  robust  to  clustering;  see  Sections  21.2.3  and  24.5. 


4.4.6.  Assumptions  for  Cross-Section  Regression 

Proposition  4.1  is  a  quite  generic  theorem  that  relies  on  assumptions  about  (W'X'X 
and  /W  'Cx'u.  In  practice  these  assumptions  are  verified  by  application  of  laws  of 
large  numbers  and  central  limit  theorems  to  averages  of  x,-x'.  and  \,u, .  These  in  turn 
require  assumptions  about  how  the  observations  x,  and  errors  k,  are  generated,  and 
consequently  how  y,  defined  in  (4.7)  is  generated.  The  assumptions  are  referred  to 
collectively  as  assumptions  regarding  the  data-generating  process  (dgp).  A  simple 
pedagogical  example  is  given  in  Exercise  4.4. 
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Our  objective  at  this  stage  is  to  make  assumptions  that  are  appropriate  in  many  ap¬ 
plied  settings  where  cross-section  data  are  used.  The  assumptions,  are  those  in  White 
(1980a),  and  include  three  important  departures  from  those  in  introductory  treatments. 
First,  the  regressors  may  be  stochastic  (Assumptions  1  and  3  that  follow),  so  assump¬ 
tions  on  the  error  are  made  conditional  on  regressors.  Second,  the  conditional  variance 
of  the  error  may  vary  across  observations  (Assumption  5).  Third,  the  errors  are  not 
restricted  to  be  normally  distributed. 

Here  are  the  assumptions: 

1.  The  data  (y,-,  x,  )  are  independent  and  not  identically  distributed  (inid)  over  i. 

2.  The  model  is  correctly  specified  so  that 

yi  =  X-/3+W,-. 

3.  The  regressor  vector  x,-  is  possibly  stochastic  with  finite  second  moment,  additionally 

E[\Xijxik\1+s]  <  oo  for  all  j,  k  =  K  for  some  S  >  0,  and  the  matrix  Mxx  defined 

in  (4.15)  exists  and  is  a  finite  positive  definite  matrix  of  rank  K.  Also,  X  has  rank  K  in 
the  sample  being  analyzed. 

4.  The  errors  have  zero  mean,  conditional  on  regressors 

E  [M;  |X;]  =  0. 


5.  The  errors  are  heteroskedastic,  conditional  on  regressors,  with 


a}  =  E  [m2|x,]  , 

$7  =  E  [uu'|X]  =  Diag  [er2] , 


(4.23) 


where  $7  is  an  N  x  N  positive  definite  matrix.  Also,  for  some  8  >  0,  E[|m2|1+<5]  <  oo. 

6.  The  matrix  Mxnx  defined  in  (4.16)  exists  and  is  a  finite  positive  definite  matrix  of  rank 
K,  where  Mxnx  =  plim  A”1  «2x,  x'  given  independence  over  i.  Also,  for  some  8  > 

0,  E[\ujxjjXik\1+s]  <  oo  for  all  j,  k  =  1, . . . ,  K. 


4.4.7.  Remarks  on  Assumptions 

For  completeness  we  provide  a  detailed  discussion  of  each  assumption,  before  proving 
the  key  results  in  the  following  section. 


Stratified  Random  Sampling 

Assumption  1  is  one  that  is  often  implicitly  made  for  cross-section  data.  Here  we  make 
it  explicit.  It  restricts  (y, ,  x, )  to  be  independent  over  i,  but  permits  the  distribution  to 
differ  over  i.  Many  microeconometrics  data  sets  come  from  stratified  random  sam¬ 
pling  (see  Section  3.2).  Then  the  population  is  partitioned  into  strata  and  random  draws 
are  made  within  strata,  but  some  strata  are  oversampled  with  the  consequence  that  the 
sampled  (y,-,x,)  are  inid  rather  than  iid.  If  instead  the  data  come  from  simple  ran¬ 
dom  sampling  then  (y, ,  x, )  are  iid,  a  stronger  assumption  that  is  a  special  case  of  inid. 
Many  introductory  treatments  assumed  that  regressors  are  fixed  in  repeated  samples. 
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Then  (yt ,  x, )  are  inid  since  only  y,  is  random  with  a  value  that  depends  on  the  value  of 
x, .  The  fixed  regressors  assumption  is  rarely  appropriate  for  microeconometrics  data, 
which  are  usually  observational  data.  It  is  used  instead  for  experimental  data,  where  x 
is  the  treatment  level. 

These  different  assumptions  on  the  distribution  of  ( y, .  x,)  affect  the  particular  laws 
of  large  numbers  and  central  limit  theorems  used  to  obtain  the  asymptotic  properties 
of  the  OLS  estimator.  Note  that  even  if  (y,-,  x,  )  are  iid,  y,  given  x,  is  not  iid  since,  for 
example,  E[y,  |x,]  =  x'/3  varies  with  x,-. 

Assumption  1  rules  out  most  time-series  data  since  they  are  dependent  over  obser¬ 
vations.  It  will  also  be  violated  if  the  sampling  scheme  involves  clustering  of  observa¬ 
tions.  The  OLS  estimator  can  still  be  consistent  in  these  cases,  provided  Assumptions 
2-4  hold,  but  usually  it  has  a  variance  matrix  different  from  that  presented  in  this 
chapter. 


Correctly  Specified  Model 

Assumption  2  seems  very  obvious  as  it  is  an  essential  ingredient  in  the  derivation  of 
the  OLS  estimator.  It  still  needs  to  be  made  explicitly,  however,  since  /3  =  (X'X)  'X'v 
is  a  function  of  y  and  so  its  properties  depend  on  y. 

If  Assumption  2  holds  then  it  is  being  assumed  that  the  regression  model  is  linear  in 
x,  rather  than  nonlinear,  that  there  are  no  omitted  variables  in  the  regression,  and  that 
there  is  no  measurement  error  in  the  regressors,  as  the  regressors  x  used  to  calculate 
/3  are  the  same  regressors  x  that  are  in  the  dgp.  Also,  the  parameters  [3  are  the  same 
across  individuals,  ruling  out  random  parameter  models. 

If  Assumption  2  fails  then  OLS  can  only  be  interpreted  as  an  optimal  linear  predic¬ 
tor;  see  Section  4.2.3. 


Stochastic  Regressors 

Assumption  3  permits  regressors  to  be  stochastic  regressors,  as  is  usually  the  case 
when  survey  data  rather  than  experimental  data  are  used.  It  is  assumed  that  in  the  limit 
the  sample  second-moment  matrix  is  constant  and  nonsingular. 

If  the  regressors  are  iid,  as  is  assumed  under  simple  random  sampling,  then 
Mxx  =E[xx']  and  Assumption  3  can  be  reduced  to  an  assumption  that  the  second 
moment  exists.  If  the  regressors  are  stochastic  but  inid,  as  is  the  case  for  stratified 
random  sampling,  then  we  need  the  stronger  Assumption  3,  which  permits  applica¬ 
tion  of  the  Markov  LLN  to  obtain  plim  N  'X'X.  If  the  regressors  are  fixed  in  repeated 
samples,  the  common  less-satisfactory  assumption  made  in  introductory  courses,  then 
Mxx  =  lim  N~  1  X'X  and  Assumption  3  becomes  assumption  that  this  limit  exists. 


Weakly  Exogenous  Regressors 

Assumption  4  of  zero  conditional  mean  errors  is  crucial  because  when  combined 
with  Assumption  2  it  implies  that  E[y|X]  =  X/3,  so  that  the  conditional  mean  is  indeed 
X/3. 
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The  assumption  that  E[h|x]  =  0  implies  that  Cov[x,w]  =  0,  so  that  the  error  is  un¬ 
correlated  with  regressors.  This  follows  as  Cov[x,w]  =E[xn]  —  E[x]E[m]  and  E[u|x]  = 
0  implies  E[xn]  =  0  and  E[ it  \  =  0  by  the  law  of  iterated  expectations.  The  weaker 
assumption  that  Cov[x,w  ]  =  0  can  be  sufficient  for  consistency  of  OLS,  whereas  the 
stronger  assumption  that  E[«|x|  =  0  is  needed  for  unbiasedness  of  OLS. 

The  economic  meaning  of  Assumption  4  is  that  the  error  term  represents  all  the 
excluded  factors  that  are  assumed  to  be  uncorrelated  with  X  and  these  have,  on  av¬ 
erage,  zero  impact  on  y.  This  is  a  key  assumption  that  was  referred  to  in  Section  2.3 
as  the  weak  exogeneity  assumption.  Essentially  this  means  that  the  knowledge  of  the 
data-generating  process  for  X  variables  does  not  contribute  useful  information  for  es¬ 
timating  (3.  When  the  assumption  fails,  one  or  more  of  the  K  regressor  variables  is 
said  to  be  jointly  dependent  with  y,  or  simply  endogenous.  A  general  term  for  cor¬ 
relation  of  regressors  with  errors  is  endogeneity  or  endogenous  regressors,  where 
the  term  “endogenous”  means  caused  by  factors  inside  the  system.  As  we  will  show 
in  Section  4.7,  the  violation  of  weak  exogeneity  may  lead  to  inconsistent  estimation. 
There  are  many  ways  in  which  weak  exogeneity  can  be  violated,  but  one  of  the  most 
common  involves  a  variable  in  x  that  is  a  choice  or  a  decision  variable  that  is  related 
to  y  in  a  larger  model.  Ignoring  these  other  relationships,  and  treating  x,  as  if  it  were 
randomly  assigned  to  observation  i,  and  hence  uncorrelated  with  u , ,  will  have  non¬ 
trivial  consequences.  Endogenous  sampling  is  ruled  out  by  Assumption  4.  Instead, 
if  data  are  collected  by  stratified  random  sampling  it  must  be  exogenous  stratified 
sampling. 


Conditionally  Heteroskedastic  Errors 

Independent  regression  errors  uncorrelated  with  regressors  are  assumed,  a  conse¬ 
quence  of  Assumptions  1,  2,  and  4.  Introductory  courses  usually  further  restrict  at¬ 
tention  to  errors  that  are  homoskedastic  with  homogeneous  or  constant  variances,  in 
which  case  of  =  o2  for  all  i.  Then  the  errors  are  iid  (0,  o2)  and  are  called  spherical 
errors  since  fl  =  <j21. 

Assumption  5  is  instead  one  of  conditionally  heteroskedastic  regression  errors, 
where  heteroskedastic  means  heterogeneous  variances  or  different  variances.  The  as¬ 
sumption  is  stated  in  terms  of  the  second  moment  E[m2|x],  but  this  equals  the  vari¬ 
ance  V[m|x]  since  E[m|x]  =  0  by  Assumption  4.  This  more  general  assumption  of  het¬ 
eroskedastic  errors  is  made  because  empirically  this  is  often  the  case  for  cross-section 
regression.  Furthermore,  relaxing  the  homoskedasticity  assumption  is  not  costly  as  it 
is  possible  to  obtain  valid  standard  errors  for  the  OLS  estimator  even  if  the  functional 
form  for  the  heteroskedasticity  is  unknown. 

The  term  conditionally  heteroskedastic  is  used  for  the  following  reason.  Even  if 
(}’,■ ,  x,)  are  iid,  as  is  the  case  for  simple  random  sampling,  once  we  condition  on  x, 
the  conditional  mean  and  conditional  variance  can  vary  with  x,  .  Similarly,  the  errors 
Uj  =  yt  —  x'/3  are  iid  under  simple  random  sampling,  and  they  are  therefore  uncon¬ 
ditionally  homoskedastic.  Once  we  condition  on  x,-,  and  consider  the  distribution  of 
Uj  conditional  on  x(,  the  variance  of  this  conditional  distribution  is  permitted  to  vary 
with  x, . 
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Limit  Variance  Matrix  of  N  1/2X'u 

Assumption  6  is  needed  to  obtain  the  limit  variance  matrix  of  N  */2X'u.  If  regressors 
are  independent  of  the  errors,  a  stronger  assumption  than  that  made  in  Assumption 
4,  then  Assumption  5  that  E[|w2|1+<5]  <  oo  and  Assumption  3  that  E^XjjX^ | l+i  |  <  oo 
imply  the  Assumption  6  condition  that  E[\ujxijXjk\l+s]  <  oo. 

We  have  deliberately  not  made  a  seventh  assumption,  that  the  error  u  is  normally 
distributed  conditional  on  X.  An  assumption  such  as  normality  is  needed  to  obtain  the 
exact  small-sample  distribution  of  the  OLS  estimator.  However,  we  focus  on  asymp¬ 
totic  methods  throughout  this  book,  because  exact  small-sample  distributional  results 
are  rarely  available  for  the  estimators  used  in  microeconometrics,  and  then  the  normal¬ 
ity  assumption  is  no  longer  needed. 


4.4.8.  Derivations  for  the  OLS  Estimator 

Here  we  present  both  small-sample  and  limit  distributions  of  the  OLS  estimator  and 
justify  White’s  estimator  of  the  variance  matrix  of  the  OLS  estimator  under  Assump¬ 
tions  1-6. 


Small- Sample  Distribution 

The  parameter  / 3  is  identified  under  Assumptions  1-4  since  then  E[y|X]  =  X/3  and  X 
has  rank  K. 

In  small  samples  the  OLS  estimator  is  unbiased  under  Assumptions  1-4  and  its  vari¬ 
ance  matrix  is  easily  obtained  given  Assumption  5.  These  results  are  obtained  by  using 
the  law  of  iterated  expectations  to  first  take  expectation  with  respect  to  u  conditional 
on  X  and  then  take  the  unconditional  expectation.  Then  from  (4.11) 

E[3ols]  =  /3  +  Ex.u  [(X'XT'X'u]  (4.24) 

=  (3  +  Ex  [Eu,x  [(X'X)_1X'u|X]] 

=  /3  +  Ex  [(X'X)_1X'EU|x[u|X]] 

=  13, 

using  the  law  of  iterated  expectations  (Theorem  A.23)  and  given  Assumptions  1  and 
4,  which  together  imply  that  E[u|X]  =  0.  Similarly,  (4.11)  yields 

V[/3OLS]  =  Ex  [(X'X)- 1  X'f2X(X'X)_  1  ] ,  (4.25) 

given  Assumption  5,  where  E[uu'|X]  =  and  we  use  Theorem  A.23,  which  tells  us 
that  in  general 

Vx.u[g(X,  u)]  =  Ex[Vu|X[g(X,  u)]]  +  Vx[Eu|x[g(X,  u)]]. 

This  simplifies  here  as  the  second  term  is  zero  since  E„|x[(X'X)_1X'u]  =  0. 

The  OLS  estimator  is  therefore  unbiased  if  E[u|X]  =  0.  This  valuable  property 
generally  does  not  extend  to  nonlinear  estimators.  Most  nonlinear  estimators,  such 
as  nonlinear  least  squares,  are  biased  and  even  linear  estimators  such  as  instrumental 


79 


LINEAR  MODELS 


variables  estimators  can  be  biased.  The  OLS  estimator  is  inefficient,  as  its  variance 
is  not  the  smallest  possible  variance  matrix  among  linear  unbiased  estimators,  unless 
n  =  a2 1.  The  inefficiency  of  OLS  provides  motivation  for  more  efficient  estimators 
such  as  generalized  least  squares,  though  the  efficiency  loss  of  OLS  is  not  necessarily 
great.  Under  the  additional  assumption  of  normality  of  the  errors  conditional  on  X,  an 
assumption  not  usually  made  in  microeconometrics  applications,  the  OLS  estimator  is 
normally  distributed  conditional  on  X. 


Consistency 

The  term  plim  (lV-1X'X)  1  =  Mxx'  since  plim  A  'X'X  =  Mxx  by  Assumption  3. 
Consistency  then  requires  that  condition  (4.13)  holds.  This  is  established  using  a  law 
of  large  numbers  applied  to  the  average  A^'X'u  =A_1  JT  x which  converges  in 
probability  to  zero  if  E[x,m,]  =  0.  Given  Assumptions  1  and  2,  the  x,n,  are  inid  and 
Assumptions  1-5  permit  use  of  the  Markov  LLN  (Theorem  A.9).  If  Assumption  1  is 
simplified  to  (y,-,  x,)  iid  then  x,  n,  are  iid  and  Assumptions  I  -4  permit  simpler  use  of 
the  Kolmogorov  LLN  (Theorem  A. 8). 


Limit  Distribution 

By  Assumption  3,  plim  (A  'X'X)  1  =  Mxx .  The  key  is  to  obtain  the  limit  distribu¬ 
tion  of  A~1/2X'u  =  A  1/2  JT  XjUj  by  application  of  a  central  limit  theorem.  Given 
Assumptions  1  and  2,  the  x,  n,  are  inid  and  Assumptions  1-6  permit  use  of  the  Lia- 
pounov  CLT  (Theorem  A.  15).  If  assumption  1  is  strengthened  to  (y,-,  x,)  iid  then  x,m, 
are  iid  and  Assumptions  1-5  permit  simpler  use  of  the  Lindeberg-Levy  CLT  (Theo¬ 
rem  A.  14). 

This  yields 

-JLx'u4  Af[0,Mxni],  (4.26) 

VA 

where  Mxnx  =  plim  A~ 1  X'uu'X  =  plim  A  1  JT  u .?x,-x-  given  independence  over  i. 
Application  of  a  law  of  large  numbers  yields  Mxnx  =  lim  A-1  ^TEx[<t2x,x' ].  us¬ 
ing  Em.jX( [u2xix'j]  =  Ex. [E[m?|x,-]x;X-]  and  of  =  E[m?|x,-].  It  follows  that  Mx0x  = 
lim Ai_1E[X'f2X],  where  12  =  Diag[rr2 ]  and  the  expectation  is  with  respect  to  only 
X,  rather  than  both  X  and  u. 

The  presentation  here  assumes  independence  over  i .  More  generally  we  can  permit 
correlated  observations.  Then  Mxnx  =  plim  A-1  JT  UjUjXjX ’■  and  12  has  i / tli  en¬ 

try  Oij  —  Cov[h,  ,  u  j  ] .  This  complication  is  deferred  to  treatment  of  the  nonlinear  LS 
estimator  in  Section  5.8. 


Heteroskedasticity-Robust  Standard  Errors 

We  consider  the  key  step  of  consistent  estimation  of  Mxnx.  Beginning  with  the  original 
definition  of  Mx<jx  =  plim  iW1  ujXjx'j,  we  replace  m,  by  w,  =  y,  —  x'/3,  where 
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asymptotically  Ti)  \  Uj  since  (3  —>  (3.  This  yields  the  consistent  estimate 

1  A 

MxDx  =  -  (4.27) 

N  1=1 

where  il  =  Diag[w?].  The  additional  assumption  that  E[\xj-xn,xu  | l+<s  |  <  A  for  positive 
constants  S  and  A  and  j,k,l  =  1, . . . ,  K  is  needed,  as  ujxjX^  =  (uj  —  xJ(/3  —  (3 ))2x,xJ 
involves  up  to  the  fourth  power  of  x,  (see  White  (1980a)). 

Note  that  il  does  not  converge  to  the  N  x  N  matrix  il.  a  seemingly  impos¬ 
sible  task  without  additional  structure  as  there  are  N  variances  of  to  be  esti¬ 
mated.  But  all  that  is  needed  is  that  iW'X'HX  converges  to  the  K  x  K  matrix 
plim  A^'X'fiX  =N  1  pi i m  JT  ofx-t x'. .  This  is  easier  to  achieve  because  the  number 
of  regressors  K  is  fixed.  To  understand  White’s  estimator,  consider  OLS  estimation  of 
the  intercept-only  model  y,  =  /;  +  u,  with  heteroskedastic  error.  Then  in  our  notation 
we  can  show  that  /3  =  y,  Mxx  =  lim  N  ~ 1  1  =  1,  and  Mxnx  =  lim  N  1  E [uf]. 

An  obvious  estimator  for  Mx<jx  is  Mxnx  =  A  Ti~,  where  if,  =  y,  —  j3.  To  obtain 
the  probability  limit  of  this  estimate,  it  is  enough  to  consider  N  1  JT  uj,  since  if,-  — 
m,  —>  0  given  ft  -f-  [i.  If  a  law  of  large  numbers  can  be  applied  this  average  converges 
to  the  limit  of  its  expected  value,  so  plim  N  1  JT  uj  =  lim  N  E\uj  |  =  Mx<)x  as 
desired.  Eicker  (1967)  gave  the  formal  conditions  for  this  example. 


4.5.  Weighted  Least  Squares 

If  robust  standard  errors  need  to  be  used  efficiency  gains  are  usually  possible.  For 
example,  if  heteroskedasticity  is  present  then  the  feasible  generalized  least-squares 
(GLS)  estimator  is  more  efficient  than  the  OLS  estimator. 

In  this  section  we  present  the  feasible  GLS  estimator,  an  estimator  that  makes 
stronger  distributional  assumptions  about  the  variance  of  the  error  term.  It  is  nonethe¬ 
less  possible  to  obtain  standard  errors  of  the  feasible  GLS  estimator  that  are  robust  to 
misspecification  of  the  error  variance,  just  as  in  the  OLS  case. 

Many  studies  in  microeconometrics  do  not  take  advantage  of  the  potential  efficiency 
gains  of  GLS,  for  reasons  of  convenience  and  because  the  efficiency  gains  may  be  felt 
to  be  relatively  small.  Instead,  it  is  common  to  use  less  efficient  weighted  least-squares 
estimators,  most  notably  OLS,  with  robust  estimates  of  the  standard  errors. 


4.5.1.  GLS  and  Feasible  GLS 

By  the  Gauss-Markov  theorem,  presented  in  introductory  texts,  the  OLS  estimator  is 
efficient  among  linear  unbiased  estimators  if  the  linear  regression  model  errors  are 
independent  and  homoskedastic. 

Instead,  we  assume  that  the  error  variance  matrix  Sl^o2I.  If  $7  is  known  and 
nonsingular,  we  can  premultiply  the  linear  regression  model  (4.8)  by  il  l4,  where 
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121/2121/2  =  $7,  to  yield 

rr1/2y  =  n~1/2x/3  +  rr1/2u. 

Some  algebra  yields  V[12_1/2u]  =  E[(!2_1/2u)(12_1/2uy|X]  =  I.  The  errors  in  this 
transformed  model  are  therefore  zero  mean,  uncorrelated,  and  homoskedastic.  So  /3 
can  be  efficiently  estimated  by  OLS  regression  of  12  l  2v  on  12  l/2X. 

This  argument  yields  the  generalized  least-squares  estimator 

3gls  =  (X'n-1X)-1X/n-1y.  (4.28) 

The  GLS  estimator  cannot  be  directly  implemented  because  in  practice  12  is  not 
known.  Instead,  we  specify  that  12  =  12(7),  where  7  is  a  finite-dimensional  parameter 
vector,  obtain  a  consistent  estimate  7  of  7,  and  form  12  =  12(7).  For  example,  if  errors 
are  heteroskedastic  then  specify  V[w|x]  =  cxp(z'7),  where  z  is  a  subset  of  x  and  the 
exponential  function  is  used  to  ensure  a  positive  variance.  Then  7  can  be  consistently 
estimated  by  nonlinear  least-squares  regression  (see  Section  5.8)  of  the  squared  OLS 
residual  u2  =(y  —  x'(3OLS )2  on  exp(z'7).  This  estimate  12  can  be  used  in  place  of  12 
in  (4.28).  Note  that  we  cannot  replace  12  in  (4.28)  by  12  =  Diag[ w2  ]  as  this  yields  an 
inconsistent  estimator  (see  Section  5.8.6). 

The  feasible  generalized  least-squares  (FGLS)  estimator  is 

3fgls  =  (X'fT'xr'X'fry  (4.29) 


If  Assumptions  1-6  hold  and  12(7)  is  correctly  specified,  a  strong  assumption  that  is 
relaxed  in  the  following,  and  7  is  consistent  for  7,  it  can  be  shown  that 


x/n (/3fgls  —  f3)  —>  J\f 


0,  (plimA_1X'12  'xj 


(4.30) 


The  FGLS  estimator  has  the  same  limiting  variance  matrix  as  the  GLS  estimator  and 
so  is  second-moment  efficient.  For  implementation  replace  12  by  12  in  (4.30). 

It  can  be  shown  that  the  GLS  estimator  minimizes  u'12  'u,  see  Exercise  4.5,  which 
simplifies  to  JT  u2 /a2  if  errors  are  heteroskedastic  but  uncorrelated.  The  motivation 
provided  for  GLS  was  efficient  estimation  of  (3.  In  terms  of  the  Section  4.2  discussion 
of  loss  function  and  optimal  prediction,  with  heteroskedastic  errors  the  loss  function  is 
L(e)  =  e2 /a2.  Compared  to  OLS  with  L(e)  =  e2,  the  GLS  loss  function  places  a  rel¬ 
atively  smaller  penalty  on  the  prediction  error  for  observations  with  large  conditional 
error  variance. 


4.5.2.  Weighted  Least  Squares 

The  result  in  (4.30)  assumes  correct  specification  of  the  error  variance  matrix  12(7). 
If  instead  12(7)  is  misspecified  then  the  FGLS  estimator  is  still  consistent,  but  (4.30) 
gives  the  wrong  variance.  Fortunately,  a  robust  estimate  of  the  variance  of  the  GLS 
estimator  can  be  found  even  if  12(7)  is  misspecified. 

Specifically,  define  X  =  £(7)  to  be  a  working  variance  matrix  that  does  not  nec¬ 
essarily  equal  the  true  variance  matrix  12  =E[uu,|X].  Form  an  estimate  X  =  £(7), 
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Table  4.2.  Least-Squares  Estimators  and  Their  Asymptotic  Variance 


Estimator" 

Definition 

Estimated  Asymptotic  Variance 

OLS 

3=  (X'X)-1X'y 

(X'X)_1X'f2X  (X'X)-1 

FGLS 

3  =  (x'fr'x^x'frV 

(x'fT'xr1 

WLS 

3  =  (X'E_1X)_1X'E_1y 

(X'E_1X)_1X'E_1f2E_1X(X'E_1X)_1. 

a  Estimators  are  for  linear  regression  model  with  error  conditional  variance  matrix  Q.  For  FGLS  it  is 
assumed  that  Q  is  consistent  for  Q.  For  OLS  and  WLS  the  heteroskedastic  robust  variance  matrix  of  ft 
uses  Q  equal  to  a  diagonal  matrix  with  squared  residuals  on  the  diagonals. 


where  7  is  an  estimate  of  7.  Then  use  weighted  least  squares  with  weighting  ma¬ 
trix  E_  . 

This  yields  the  weighted  least-squares  (WLS)  estimator 

3wls  =  (X'S-'xr'X'S-'y.  (4.31) 

Statistical  inference  is  then  done  without  the  assumption  that  S  =  f 2,  the  true  variance 
matrix  of  the  error  term.  In  the  statistics  literature  this  approach  is  referred  to  as  a 
working  matrix  approach.  We  call  it  weighted  least  squares,  but  be  aware  that  others 
instead  use  weighted  least  squares  to  mean  GLS  or  FGLS  in  the  special  case  that  12  1 
is  diagonal.  Here  there  is  no  presumption  that  the  weighting  matrix  S  =  il  1 . 

Similar  algebra  to  that  for  OLS  given  in  Section  4.4.5  yields  the  estimated  asymp¬ 
totic  variance  matrix 

V[/3wls]  =  (X'  E  _  1 X)- 1  X'E  ~1  f2  E  _  1  X(X'  E  _  1 X)- 1 ,  (4.32) 

where  12  is  such  that 

plim  !V-1X'E-If2E-1X  =  plim  lV_1X'E_1f2E_1X. 

In  the  heteroskedastic  case  12  =  Diag[«*2],  where  u*  =  _y,  —  x'/3WLS. 

For  heteroskedastic  errors  the  basic  approach  is  to  choose  a  simple  model  for  het- 
eroskedasticity  such  as  error  variance  depending  on  only  one  or  two  key  regressors.  For 
example,  in  a  linear  regression  model  of  the  level  of  wages  as  a  function  of  schooling 
and  other  variables,  the  heteroskedasticity  might  be  modeled  as  a  function  of  school¬ 
ing  alone.  Suppose  this  model  yields  S  =  Diag[a2].  Then  OLS  regression  of  y, /a,  on 
x,  /if  ,  (with  the  no-constant  option)  yields  /3WLS  and  the  White  robust  standard  errors 
from  this  regression  can  be  shown  to  equal  those  based  on  (4.32). 

The  weighted  least-squares  or  working  matrix  approach  is  especially  convenient 
when  there  is  more  than  one  complication.  For  example,  in  the  random  effects  panel 
data  model  of  Chapter  21  the  errors  may  be  viewed  as  both  correlated  over  time  for  a 
given  individual  and  heteroskedastic.  One  may  use  the  random  effects  estimator,  which 
controls  only  for  the  first  complication,  but  then  compute  heteroskedastic-consistent 
standard  errors  for  this  estimator. 

The  various  least-squares  estimators  are  summarized  in  Table  4.2. 
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Table  4.3.  Least  Squares:  Example  with 
Conditionally  Heteroskedastic  Errors a 


OLS 

WLS 

GLS 

Constant 

2.213 

1.060 

0.996 

(0.823) 

(0.150) 

(0.007) 

[0.820] 

[0.051] 

[0.006] 

X 

0.979 

0.957 

0.952 

(0.178) 

(0.190) 

(0.209) 

[0.275] 

[0.232] 

[0.208] 

R2 

0.236 

0.205 

0.174 

a  Generated  data  for  sample  size  of  100.  OLS,  WLS,  and  GLS 
are  all  consistent  but  OLS  and  WLS  are  inefficient.  Two  differ¬ 
ent  standard  errors  are  given:  default  standard  errors  assuming 
homoskedastic  errors  in  parentheses  and  heteroskedastic-robust 
standard  errors  in  square  brackets.  The  data-generating  process 
is  given  in  the  text. 


4.5.3.  Robust  Standard  Errors  for  LS  Example 

As  an  example  of  robust  standard  error  estimation,  consider  estimation  of  the  standard 
error  of  least-squares  estimates  of  the  slope  coefficient  for  a  dgp  with  multiplicative 
heteroskedasticity 

y  —  1  +  1  x  x  +  u, 
u  —  xs, 

where  the  scalar  regressor  x  ~  Ar[0.  25]  and  s  ~  Ar[0. 4]. 

The  errors  are  conditionally  heteroskedastic,  since  V[n \x]  =  V[xe|x]  = 
x2V[e|x]  =  Ax1,  which  depends  on  the  regressor  x.  This  differs  from  the  unconditional 
variance,  where  V[n]  =  V[xe]  =  E[(x£)2]  —  (E[xe])2  =  E[x2]E[e2]  =  V[x]V[£]  = 
100,  given  x  and  s  independent  and  the  particular  dgp  here. 

Standard  errors  for  the  OLS  estimator  should  be  calculated  using  the 
heteroskedastic -consistent  or  robust  variance  estimate  (4.21).  Since  OLS  is  not  fully 
efficient,  WLS  may  provide  efficiency  gains.  GLS  will  definitely  provide  efficiency 
gains  and  in  this  simulated  data  example  we  have  the  advantage  of  knowing  that 
V[«|x]  =  4x2.  All  estimation  methods  yield  a  consistent  estimate  of  the  intercept  and 
slope  coefficients. 

Various  least-squares  estimates  and  associated  standard  errors  from  a  generated  data 
sample  of  size  100  are  given  in  Table  4.3.  We  focus  on  the  slope  coefficient. 

The  OLS  slope  coefficient  estimate  is  0.979.  Two  standard  error  estimates  are  re¬ 
ported,  with  the  correct  heteroskedasticity -robust  standard  error  of  0.275  using  (4.21) 
much  larger  here  than  the  incorrect  estimate  of  0.177  that  uses  s2(X'X)  '.  Such  a  large 
difference  in  standard  error  estimates  could  lead  to  quite  different  conclusions  in  statis¬ 
tical  inference.  In  general  the  direction  of  bias  in  the  standard  errors  could  be  in  either 
direction.  Lor  this  example  it  can  be  shown  theoretically  that,  in  the  limit,  the  robust 
standard  errors  are  vT  times  larger  than  the  incorrect  one.  Specifically,  for  this  dgp 
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and  for  sample  size  N  the  correct  and  incorrect  standard  errors  of  the  OLS  estimate  of 
the  slope  coefficient  converge  to,  respectively,  *JT2/N  and  s/4/ N . 

As  an  example  of  the  WLS  estimator,  assume  that  u  =  sj\x\e  rather  than  u  =  xs, 
so  that  it  is  assumed  that  V[n]  =  cr2\x\.  The  WLS  estimator  can  be  computed  by  OLS 
regression  after  dividing  y,  the  intercept,  and  x  by  sf\x\.  Since  this  is  the  wrong  model 
for  the  heteroskedastic  error  the  correct  standard  error  for  the  slope  coefficient  is  the 
robust  estimate  of  0.232,  computed  using  (4.32). 

The  GLS  estimator  for  this  dgp  can  be  computed  by  OLS  regression  after  dividing 
y,  the  intercept,  and  x  by  \x\,  since  the  transformed  error  is  then  homoskedastic.  The 
usual  and  robust  standard  errors  for  the  slope  coefficient  are  similar  (0.209  and  0.208). 
This  is  expected  as  both  are  asymptotically  correct  because  the  GLS  estimator  here 
uses  the  correct  model  for  heteroskedasticity.  It  can  be  shown  theoretically  that  for  this 
dgp  the  standard  error  of  the  GLS  estimate  of  the  slope  coefficient  converges  to  s/4/ N . 

Both  OLS  and  WLS  are  less  efficient  than  GLS,  as  expected,  with  standard  errors 
for  the  slope  coefficient  of,  respectively,  0.275  >  0.232  >  0.208. 

The  setup  in  this  example  is  a  standard  one  used  in  estimation  theory  for  cross- 
section  data.  Both  y  and  x  are  stochastic  random  variables.  The  pair  (y,  ,  x,)  are  inde¬ 
pendent  over  i  and  identically  distributed,  as  is  the  case  under  random  sampling.  The 
conditional  distribution  of  y,  |x,  differs  over  i,  however,  since  the  conditional  mean  and 
variance  of  y,  depend  on  x, . 


4.6,  Median  and  Quantile  Regression 

In  an  intercept-only  model,  summary  statistics  for  the  sample  distribution  include 
quantiles,  such  as  the  median,  lower  and  upper  quartiles,  and  percentiles,  in  addition 
to  the  sample  mean. 

In  the  regression  context  we  might  similarly  be  interested  in  conditional  quantiles. 
For  example,  interest  may  lie  in  how  the  percentiles  of  the  earnings  distribution  for 
lowly  educated  workers  are  much  more  compressed  than  those  for  highly  educated 
workers.  In  this  simple  example  one  can  just  do  separate  computations  for  lowly  ed¬ 
ucated  workers  and  for  highly  educated  workers.  However,  this  approach  becomes 
infeasible  if  there  are  several  regressors  taking  several  values.  Instead,  quantile  regres¬ 
sion  methods  are  needed  to  estimate  the  quantiles  of  the  conditional  distribution  of  y 
given  x. 

From  Table  4. 1 ,  quantile  regression  corresponds  to  use  of  asymmetric  absolute  loss, 
whereas  the  special  case  of  median  regression  uses  absolute  error  loss.  These  methods 
provide  an  alternative  to  OLS,  which  uses  squared  error  loss. 

Quantile  regression  methods  have  advantages  beyond  providing  a  richer  charac¬ 
terization  of  the  data.  Median  regression  is  more  robust  to  outliers  than  least-squares 
regression.  Moreover,  quantile  regression  estimators  can  be  consistent  under  weaker 
stochastic  assumptions  than  possible  with  least-squares  estimation.  Leading  examples 
are  the  maximum  score  estimator  of  Manski  (1975)  for  binary  outcome  models  (see 
Section  14.6)  and  the  censored  least  absolute  deviations  estimator  of  Powell  (1984)  for 
censored  models  (see  Section  16.6). 
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We  begin  with  a  brief  explanation  of  population  quantiles  before  turning  to  estima¬ 
tion  of  sample  quantiles. 


4.6.1.  Population  Quantiles 

For  a  continuous  random  variable  y,  the  population  c/th  quantile  is  that  value  iiq  such 
that  y  is  less  than  or  equal  to  jiq  with  probability  q.  Thus 

q  =  Pr[y  <  nq]  =  FyUiq), 

where  Fy  is  the  cumulative  distribution  function  (cdf)  of  y.  For  example,  if  /ro.75  =  3 
then  the  probability  that  _y  <  3  equals  0.75.  It  follows  that 

/T,  =  F~\q). 

Leading  examples  are  the  median,  q  =  0.5,  the  upper  quartile,  q  =  0.75,  and  the  lower 
quartile,  q  =  0.25.  For  the  standard  normal  distribution  /i 0.5  =  0.0,  /xo.95  =  1 .645,  and 
/L0.975  =  1.960.  The  lOOc/th  percentile  is  the  c/th  quantile. 

For  the  regression  model,  the  population  c/th  quantile  of  y  conditional  on  x  is 
that  function  jiq(x)  such  that  y  conditional  on  x  is  less  than  or  equal  to  nq(x)  with 
probability  q,  where  the  probability  is  evaluated  using  the  conditional  distribution  of 
y  given  x.  It  follows  that 

Mx)  =  Fy\x(q)’  (4.33) 

where  Fy |x  is  the  conditional  cdf  of  y  given  x  and  we  have  suppressed  the  role  of  the 
parameters  of  this  distribution. 

It  is  insightful  to  derive  the  quantile  function  nq(x)  if  the  dgp  is  assumed  to  be  the 
linear  model  with  multiplicative  heteroskedasticity 

y  —  x'(3  +  u , 
u  —  x'a  x  e, 
e  ~  iid  [0,  cr2], 

where  it  is  assumed  that  x'a  >  0.  Then  the  population  c/th  quantile  of  y  conditional 
on  x  is  that  function  jiq(x,  (3,  a)  such  that 

q  =  pr[y  <  fiq(x,  (3,  a)] 

=  Pr  [u  <  fiq(x,  (3,  a)  —  x'/3] 

=  Pr  [fi  <  [qq(x,  (3,  a)  —  x'/3]/x'a:] 

=  Fe  ([nq(x,  (3,  a)  -  x'f3 ]/x'a) , 

where  we  use  u  =  y  —  x'f3  and  s  =  u/x'a,  and  F,  is  the  cdf  of  s.  It  follows  that 
[/x9(x,  (3,  a)  —  x!(3\/x!a  =  F~l(q )  so  that 

/iq(x,  (3,  a)  =  x'/3  +  x'a  x  F~1(q ) 

=  x'  ((3  +  a  x  F~\q ))  . 


86 


4.6.  MEDIAN  AND  QUANTILE  REGRESSION 


Thus  for  the  linear  model  with  multiplicative  heteroskedasticity  of  the  form  u  =  x'«  x 
s  the  conditional  quantiles  are  linear  in  x.  In  the  special  case  of  homoskedasticity,  x'a 
equals  a  constant  and  all  conditional  quantiles  have  the  same  slope  and  differ  only  in 
their  intercept,  which  becomes  larger  as  q  increases. 

In  more  general  examples  the  quantile  function  may  be  nonlinear  in  x,  owing  to 
other  forms  of  heteroskedasticity  such  as  u  =  h(x,  a)  x  s,  where  h(-)  is  nonlinear  in 
x,  or  because  the  regression  function  itself  is  of  nonlinear  form  g(x,  f3).  It  is  standard 
to  still  estimate  quantile  functions  that  are  linear  and  interpret  them  as  the  best  lin¬ 
ear  predictor  under  the  quantile  regression  loss  function  given  in  (4.34)  in  the  next 
section. 


4.6.2.  Sample  Quantiles 

For  univariate  random  variable  y  the  usual  way  to  obtain  the  sample  quantile  estimate 
is  to  first  order  the  sample.  Then  jlq  equals  the  [  N q  |th  smallest  value,  where  N  is  the 
sample  size  and  [  AT/  ]  denotes  Nq  rounded  up  to  the  nearest  integer.  For  example,  if 
N  =  97,  the  lower  quartile  is  the  25th  observation  since  [97  x  0.25]  =  [24.25]  =  25. 

Koenker  and  Bassett  (1978)  observed  that  the  sample  c/th  quantile  can  equiv¬ 
alently  be  expressed  as  the  solution  to  the  optimization  problem  of  minimizing  with 
respect  to  /? 


N  N 

X!  <?! yt  -  P\  +  X! (1  “  q^yi  - 

i:yi>P  i-yi<P 

This  result  is  not  obvious.  To  gain  some  understanding,  consider  the  median,  where 
q  =  0.5.  Then  the  median  is  the  minimum  of  [ y,  —  (j\.  Suppose  in  a  sample 
of  99  observations  that  the  50th  smallest  observation,  the  median,  equals  10  and 
the  51st  smallest  observation  equals  12.  If  we  let  ft  equal  12  rather  than  10,  then 
Hi  I }’i  ~  ft\  wiH  increase  by  2  for  the  first  50  ordered  observations  and  decrease  by 
2  for  the  remaining  49  observations,  leading  to  an  overall  net  increase  of  50  x  2  — 
49  x  2  =  2.  So  the  51st  smallest  observation  is  a  worse  choice  than  the  50th.  Simi¬ 
larly  the  49th  smallest  observation  can  be  shown  to  be  a  worse  choice  than  the  50th 
observation. 

This  objective  function  is  then  readily  expanded  to  the  linear  regression  case,  so 
that  the  c/th  quantile  regression  estimator  (3q  minimizes  over  fiq 

N  N 

(MA,)=  q\yi  ^  0--q)\yi-^iPq\,  (4.34) 

i:yi>x'i(3  /:y,<x-/ 3 

where  we  use  (3q  rather  than  f3  to  make  clear  that  different  choices  of  q  estimate 
different  values  of  f3.  Note  that  this  is  the  asymmetric  absolute  loss  function  given  in 
Table  4.1,  where  'y  is  restricted  to  be  linear  in  x  so  that  e  =  y  —  x'/3q .  The  special  case 

q  =  0.5  is  called  the  median  regression  estimator  or  the  least  absolute  deviations 
estimator. 
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4.6.3.  Properties  of  Quantile  Regression  Estimators 

The  objective  function  (4.34)  is  not  differentiable  and  so  the  gradient  optimization 
methods  presented  in  Chapter  10  are  not  applicable.  Fortunately,  linear  programming 
methods  can  be  used  and  these  provide  relatively  fast  computation  of  f3q . 

Since  there  is  no  explicit  solution  for  (3q  the  asymptotic  distribution  of  (3q  cannot 
be  obtained  using  the  approach  of  Section  4.4  for  OLS.  The  methods  of  Chapter  5  also 
require  adaptation,  as  the  objective  function  is  nondifferentiable.  It  can  be  shown  that 

y/N(Pq-0q)-+  AffO.A-'BA-1],  (4.35) 

(see,  for  example,  Buchinsky,  1998,  p.  85),  where 

1  N 

A  =  plim  —  ^2  /«,(0|x,)x,x',  (4.36) 

1  =  1 

1 

B  =  pllm  n  9(1  ~~  q)Xi*>' 

i= 1 

and  /„i;(0|x)  is  the  conditional  density  of  the  error  term  uq  =  y  —  x'0q  evaluated 
at  uq  =  0.  Estimation  of  the  variance  of  (3q  is  complicated  by  the  need  to  estimate 
fUq( 0|x).  It  is  easier  to  instead  obtain  standard  errors  for  (3q  using  the  bootstrap  pairs 
procedure  of  Chapter  1 1 . 


4.6.4.  Quantile  Regression  Example 

In  this  section  we  perform  conditional  quantile  estimation  and  compare  it  with  the 
usual  conditional  mean  estimation  using  OLS  regression.  The  application  involves  En¬ 
gel  curve  estimation  for  household  annual  medical  expenditure.  More  specifically,  we 
consider  the  regression  relationship  between  the  log  of  medical  expenditure  and  the 
log  of  total  household  expenditure.  This  regression  yields  an  estimate  of  the  (constant) 
elasticity  of  medical  expenditure  with  respect  to  total  expenditure. 

The  data  are  from  the  World  Bank’s  1997  Vietnam  Living  Standards  Survey.  The 
sample  consists  of  5,006  households  that  have  positive  level  of  medical  expenditures, 
after  dropping  16.6%  of  the  sample  that  has  zero  expenditures  to  permit  taking  the 
natural  logarithm.  Zero  values  can  be  handled  using  the  censored  quantile  regression 
methods  of  Powell  (1986a),  presented  in  Section  16.9.2.  For  simplicity  we  simply 
dropped  observations  with  zero  expenditures.  The  largest  component  of  medical  ex¬ 
penditure,  especially  at  low  levels  of  income,  consists  of  medications  purchased  from 
pharmacies.  Although  several  household  characteristic  variables  are  available,  for  sim¬ 
plicity  we  only  consider  one  regressor,  the  log  of  total  household  expenditure,  to  serve 
as  a  proxy  for  household  income. 

The  linear  least-squares  regression  yields  an  elasticity  estimate  of  0.57.  This  esti¬ 
mate  would  be  usually  interpreted  to  mean  that  medicines  are  a  “necessity”  and  hence 
their  demand  is  income  inelastic.  This  estimate  is  not  very  surprising,  but  before  ac¬ 
cepting  it  at  face  value  we  should  acknowledge  that  there  may  be  considerable  hetero¬ 
geneity  in  the  elasticity  across  different  income  groups. 
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Slope  Estimates  as  Quantile  Varies 


Figure  4.1:  Quantile  regression  estimates  of  slope  coefficient  for  q=  0.05.  0.10 . 

0.90.  0.95  and  associated  95%  confidence  bands  plotted  against  q  from  regression  of  the 
natural  logarithm  of  medical  expenditure  on  the  natural  logarithm  of  total  expenditure. 


Quantile  regression  is  a  useful  tool  for  studying  such  heterogeneity,  as  emphasized 
by  Koenker  and  Hallock  (2001).  We  minimize  the  quantity  (4.34),  where  y  is  log  of 
medical  expenditure  and  x'/3  =  ft\  +  ftnx,  where  x  is  log  of  total  household  expendi¬ 
ture.  This  is  done  for  the  nineteen  quantile  values  q  =  {0.05,  0.10, ...,  0.95} ,  where 
q  =  0.5  is  the  median.  In  each  case  the  standard  errors  were  estimated  using  the  boot¬ 
strap  method  with  50  resamples.  The  results  of  this  exercise  are  condensed  into  Fig¬ 
ures  4.1  and  4.2. 

Figure  4.1  plots  the  slope  coefficient  ft2  q  for  the  different  values  of  q,  along  with 
the  associated  95%  confidence  interval.  This  shows  how  the  quantile  estimates  of  the 
elasticity  varies  with  quantile  value  q.  The  elasticity  estimate  increases  systematically 
with  the  level  of  household  income,  rising  from  0.15  for  q  =  0.05  to  a  maximum  of 
0.80  for  q  =  0.85.  The  least-squares  slope  estimate  of  0.57  is  also  presented  as  a  hori¬ 
zontal  line  that  does  not  vary  with  quantile.  The  elasticity  estimates  at  lower  and  higher 
quantiles  are  clearly  statistically  significantly  different  from  each  other  and  from  the 
OLS  estimate,  which  has  standard  error  0.032.  It  seems  that  the  aggregate  elasticity  es¬ 
timate  will  vary  according  to  changes  in  the  underlying  income  distribution.  This  graph 
supports  the  observation  of  Mosteller  and  Tukey  (1977,  p.  236),  quoted  by  Koenker 
and  Hallock  (2001),  that  by  focusing  only  on  the  conditional  mean  function  the  least- 
squares  regression  gives  an  incomplete  summary  of  the  joint  distribution  of  dependent 
and  explanatory  variables. 

Figure  4.2  superimposes  three  estimated  quantile  regression  lines  'yg  =  ftl  + 
ft>2,qx  for  q  =0.1, 0.2, . . . ,  0.9  and  the  OLS  regression  line.  The  OLS  regression  line, 
not  graphed,  is  similar  to  the  median  (q  =  0.5)  regression  line.  There  is  a  fanning  out 
of  the  quantile  regression  lines  in  Figure  4.2.  This  is  not  surprising  given  the  increase 
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Regression  Lines  as  Quantile  Varies 
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Figure  4.2:  Quantile  regression  estimated  lines  for  q  =  0.1,  q=  0.5  and  q=  0.9  from  re¬ 
gression  of  natural  logarithm  of  medical  expenditure  on  natural  logarithm  of  total  expenditure. 
Data  for  5006  Vietnamese  households  with  positive  medical  expenditures  in  1997. 


in  estimated  slopes  as  q  increases  as  evident  in  Figure  4.1.  Koenker  and  Bassett  (1982) 
developed  quantile  regression  as  a  means  to  test  for  heteroskedastic  errors  when  the 
dgp  is  the  linear  model.  For  such  a  case  a  fanning  out  of  the  quantile  regression  lines 
is  interpreted  as  evidence  of  heteroskedasticity.  Another  interpretation  is  that  the  con¬ 
ditional  mean  is  nonlinear  in  x  with  increasing  slope  and  this  leads  to  quantile  slope 
coefficients  that  increase  with  quantile  q. 

More  detailed  illustrations  of  quantile  regression  are  given  in  Buchinsky  (1994)  and 
Koenker  and  Hallock  (2001). 


4.7.  Model  Misspecification 

The  term  “model  misspecification”  in  its  broadest  sense  means  that  one  or  more  of  the 
assumptions  made  on  the  data  generating  process  are  incorrect.  Misspecifications  may 
occur  individually  or  in  combination,  but  analysis  is  simpler  if  only  the  consequences 
of  a  single  misspecification  are  considered. 

In  the  following  discussion  we  emphasize  misspecifications  that  lead  to  inconsis¬ 
tency  of  the  least-squares  estimator  and  loss  of  identifi ability  of  parameters  of  inter¬ 
est.  The  least-squares  estimator  may  nonetheless  continue  to  have  a  meaningful  inter¬ 
pretation,  only  one  different  from  that  intended  under  the  assumption  of  a  correctly 
specified  model.  Specifically,  the  estimator  may  converge  asymptotically  to  a  param¬ 
eter  that  differs  from  the  true  population  value,  a  concept  defined  in  Section  4.7.5  as 
the  pseudo-true  value. 

The  issues  raised  here  for  consistency  of  OLS  are  relevant  to  other  estimators  in 
other  models.  Consistency  can  then  require  stronger  assumptions  than  those  needed 
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for  consistency  of  OLS,  so  that  inconsistency  resulting  from  model  misspecification  is 
more  likely. 


4.7.1.  Inconsistency  of  OLS 

The  most  serious  consequence  of  a  model  misspecification  is  inconsistent  estimation 
of  the  regression  parameters  /3.  From  Section  4.4,  the  two  key  conditions  needed  to 
demonstrate  consistency  of  the  OLS  estimator  are  (1)  the  dgp  is  y  =  X/3  +  u  and  (2) 
the  dgp  is  such  that  plim  W'X'u  =  0.  Then 

3ols  =  P  +  (iV-'X'X)-1  N~lX'u  (4  37) 

^  /L 

where  the  first  equality  follows  if  y  =  X/3  +  u  (see  (4.12))  and  the  second  line  uses 
plim  /V^'X'u  =  0. 

The  OLS  estimator  is  likely  to  be  inconsistent  if  model  misspecification  leads  to 
either  specification  of  the  wrong  model  for  y,  so  that  condition  1  is  violated,  or  corre¬ 
lation  of  regressors  with  the  error,  so  that  condition  2  is  violated. 


4.7.2.  Functional  Form  Misspecification 

A  linear  specification  of  the  conditional  mean  function  is  merely  an  approximation  in 
R  K  to  the  true  unknown  conditional  mean  function  in  parameter  space  of  indeterminate 
dimension.  Even  if  the  correct  regressors  are  chosen,  it  is  possible  that  the  conditional 
mean  is  incorrectly  specified. 

Suppose  the  dgp  is  one  with  a  nonlinear  regression  function 

y  =  g(x)  +  v, 

where  the  dependence  of  g(x)  on  unknown  parameters  is  suppressed,  and  assume 
E[u|x]  =  0.  The  linear  regression  model 

y  —  x'(3  +  u 

is  erroneously  specified.  The  question  is  whether  the  OLS  estimator  can  be  given  any 
meaningful  interpretation,  even  though  the  dgp  is  in  fact  nonlinear. 

The  usual  way  to  interpret  regression  coefficients  is  through  the  true  micro  relation¬ 
ship,  which  here  is 


E[y;|X;]  =  g(xt). 

In  this  case  /3OLS  does  not  measure  the  micro  response  of  E[  y,  |x;  ]  to  a  change  in  x(  ,  as 
it  does  not  converge  to  3g(x,)/3x(  .  So  the  usual  interpretation  of  /3OLS  is  not  possible. 

White  (1980b)  showed  that  the  OLS  estimator  converges  to  that  value  of  /3  that 
minimizes  the  mean- squared  prediction  error 

Ex[(g(x)  —  x'/3)2]. 
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Hence  prediction  from  OLS  is  the  best  linear  predictor  of  the  nonlinear  regression 
function  if  the  mean-squared  error  is  used  as  the  loss  function.  This  useful  property 
has  already  been  noted  in  Section  4.2.3,  but  it  adds  little  in  interpretation  of  /3OLS. 

In  summary,  if  the  true  regression  function  is  nonlinear,  OLS  is  not  useful  for  indi¬ 
vidual  prediction.  OLS  can  still  be  useful  for  prediction  of  aggregate  changes,  giving 
the  sample  average  change  in  E[y  |x]  due  to  change  in  x  (see  Stoker,  1982).  However, 
microeconometric  analyses  usually  seek  models  that  are  meaningful  at  the  individual 
level. 

Much  of  this  book  presents  alternatives  to  the  linear  model  that  are  more  likely  to 
be  correctly  specified.  For  example,  Chapter  14  on  binary  outcomes  presents  model 
specifications  that  ensure  that  predicted  probabilities  are  restricted  to  lie  between  0 
and  1.  Also,  models  and  methods  that  rely  on  minimal  distributional  assumptions  are 
preferred  because  there  is  then  less  scope  for  misspecification. 


4.7.3.  Endogeneity 

Endogeneity  is  formally  defined  in  Section  2.3.  A  broad  definition  is  that  a  regressor 
is  endogenous  when  it  is  correlated  with  the  error  term.  If  any  one  regressor  is  en¬ 
dogenous  then  in  general  OLS  estimates  of  all  regression  parameters  are  inconsistent 
(unless  the  exogenous  regressor  is  uncorrelated  with  the  endogenous  regressor). 

Leading  examples  of  endogeneity,  dealt  with  extensively  in  this  book  in  both  linear 
and  nonlinear  model  settings,  include  simultaneous  equations  bias  (Section  2.4),  omit¬ 
ted  variable  bias  (Section  4.7.4),  sample  selection  bias  (Section  16.5),  and  measure¬ 
ment  eiTor  bias  (Chapter  26).  Endogeneity  is  quite  likely  to  occur  when  cross-section 
observational  data  are  used,  and  economists  are  very  concerned  with  this  complication. 

A  quite  general  approach  to  control  for  endogeneity  is  the  instrumental  variables 
method,  presented  in  Sections  4.8  and  4.9  and  in  Sections  6.4  and  6.5.  This  method 
cannot  always  be  applied,  however,  as  necessary  instruments  may  not  be  available. 

Other  methods  to  control  for  endogeneity,  reviewed  in  Section  2.8,  include  con¬ 
trol  for  confounding  variables,  differences  in  differences  if  repeated  cross-section  or 
panel  data  are  available  (see  Chapter  21),  fixed  effects  if  panel  data  are  available  and 
endogeneity  arises  owing  to  a  time-invariant  omitted  variable  (see  Section  21.6),  and 
regression-discontinuity  design  (see  Section  25.6). 


4.7.4.  Omitted  Variables 

Omission  of  a  variable  in  a  linear  regression  equation  is  often  the  first  example  of 
inconsistency  of  OLS  presented  in  introductory  courses.  Such  omission  may  be  the 
consequence  of  an  erroneous  exclusion  of  a  variable  for  which  data  are  available  or  of 
exclusion  of  a  variable  that  is  not  directly  observed.  For  example,  omission  of  ability  in 
a  regression  of  earnings  (or  more  usually  its  natural  logarithm)  on  schooling  is  usually 
due  to  unavailability  of  a  comprehensive  measure  of  ability. 

Let  the  true  dgp  be 

y  =  x'/3  +  za  +  v,  (4.38) 
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where  x  and  z  are  regressors,  with  z  a  scalar  regressor  for  simplicity,  and  v  is  an  error 
term  that  is  assumed  to  be  uncorrelated  with  the  regressors  x  and  z.  OLS  estimation  of 
y  on  x  and  z  will  yield  consistent  parameter  estimates  of  (3  and  a. 

Suppose  instead  that  y  is  regressed  on  x  alone,  with  z  omitted  owing  to  unavailabil¬ 
ity.  Then  the  term  za  is  moved  into  the  error  term.  The  estimated  model  is 

y  =  x'/3  +  (za  +  v),  (4.39) 

where  the  error  term  is  now  (za  +  v).  As  before  v  is  uncorrelated  with  x,  but  if  z  is 
correlated  with  x  the  error  term  (za  +  v )  will  be  correlated  with  the  regressors  x.  The 
OLS  estimator  will  be  inconsistent  for  (3  if  z  is  correlated  with  x. 

There  is  enough  structure  in  this  example  to  determine  the  direction  of  the  inconsis¬ 
tency.  Stacking  all  observations  in  an  obvious  manner  gives  the  dgp  y  =  X/3  +  za  +  v. 
Substituting  this  into  /3OLS  =  (X'X)  'X'v  yields 

/30ls=/3+  (A-'X'X)-1  (AT'X'z)  »+  (iV^X'X)^1  (AT'X'v) . 

Under  the  usual  assumption  that  X  is  uncorrelated  with  v,  the  hnal  term  has  probability 
limit  zero.  X  is  correlated  with  z,  however,  and 

plim  /30ls  =  /3+Sa,  (4.40) 


where 


S  =  plim  [OV^X'Xr1  (A_1X'z)] 


is  the  probability  limit  of  the  OLS  estimator  in  regression  of  the  omitted  regressor  (z) 
on  the  included  regressors  (X). 

This  inconsistency  is  called  omitted  variables  bias,  where  common  terminology 
states  that  various  misspecifications  lead  to  bias  even  though  formally  they  lead  to 
inconsistency.  The  inconsistency  exists  as  long  as  8  ^  0,  that  is,  as  long  as  the  omitted 
variable  is  correlated  with  the  included  regressors.  In  general  the  inconsistency  could 
be  positive  or  negative  and  could  even  lead  to  a  sign  reversal  of  the  OLS  coefficient. 

For  the  returns  to  schooling  example,  the  correlation  between  schooling  and  ability 
is  expected  to  be  positive,  so  8  >0,  and  the  return  to  ability  is  expected  to  be  positive, 
so  a  >  0.  It  follows  that  8a  >  0,  so  the  omitted  variables  bias  is  positive  in  this  ex¬ 
ample.  OLS  of  earnings  on  schooling  alone  will  overstate  the  effect  of  education  on 
earnings. 

A  related  form  of  misspecification  is  inclusion  of  irrelevant  regressors.  For  ex¬ 
ample,  the  regression  may  be  of  y  on  x  and  z,  even  though  the  dgp  is  more  simply 
y  =  x  [3  +  v.  In  this  case  it  is  straightforward  to  show  that  OLS  is  consistent,  but  there 
is  a  loss  of  efficiency. 

Controlling  for  omitted  variables  bias  is  necessary  if  parameter  estimates  are  to  be 
given  a  causal  interpretation.  Since  too  many  regressors  cause  little  harm,  but  too  few 
regressors  can  lead  to  inconsistency,  microeconometric  models  estimated  from  large 
data  sets  tend  to  include  many  regressors.  If  omitted  variables  are  still  present  then  one 
of  the  methods  given  at  the  end  of  Section  4.7.3  is  needed. 
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4.7.5.  Pseudo-True  Value 

In  the  omitted  variables  example  the  least-squares  estimator  is  subject  to  confounding 
in  the  sense  that  it  does  not  estimate  (3,  but  instead  estimates  a  function  of  (3,  8,  and  a. 

The  OLS  estimate  cannot  be  used  as  an  estimate  of  f3,  which,  for  example,  measures 
the  effect  of  an  exogenous  change  in  a  regressor  x  such  as  schooling  holding  all  other 
regressors  including  ability  constant. 

From  (4.40),  however,  /3OLS  ,v  a  consistent  estimator  of  the  function  ((3  +  8a)  and 
has  a  meaningful  interpretation.  The  probability  limit  of  /3OLS  of  (3*  =  (J3  +  8a)  is 
referred  to  as  the  pseudo-true  value,  see  Section  5.7.1  for  a  formal  definition,  corre¬ 
sponding  to  3ols- 

Furthermore,  one  can  obtain  the  distribution  of  /30LS  even  though  it  is  inconsis¬ 
tent  for  f3.  The  estimated  asymptotic  variance  of  /3OLS  measures  dispersion  around 
(]3  +  6a)  and  is  given  by  the  usual  estimator,  for  example  by  ,v2iX'X)  1  if  the  error  in 
(4.38)  is  homoskedastic. 


4.7.6.  Parameter  Heterogeneity 

The  presentation  to  date  has  permitted  regressors  and  error  terms  to  vary  across  indi¬ 
viduals  but  has  restricted  the  regression  parameters  (3  to  be  the  same  across  individuals. 
Instead,  suppose  that  the  dgp  is 

yi  —  X'if3i+Ui,  (4.41) 

with  subscript  i  on  the  parameters.  This  is  an  example  of  parameter  heterogeneity, 
where  the  marginal  effect  E[y,  |x,]  =  (3,  is  now  permitted  to  differ  across  individuals. 

The  random  coefficients  model  or  random  parameters  model  specifies  to  be 
independently  and  identically  distributed  over  i  with  distribution  that  does  not  depend 
on  the  observables  x(  .  Let  the  common  mean  of  /3;  be  denoted  (3.  The  dgp  can  be 
rewritten  as 


yi  =  x'/3  +  ( Ui  +  xfpi  -  (3)), 

and  enough  assumptions  have  been  made  to  ensure  that  the  regressors  x,  are  uncorre¬ 
lated  with  the  error  term  («,  +  xJ(/3,  —  (3)).  OLS  regression  of  y  on  x  will  therefore 
consistently  estimate  /3,  though  note  that  the  error  is  heteroskedastic  even  if  w,  is  ho¬ 
moskedastic. 

For  panel  data  a  standard  model  is  the  random  effects  model  (see  Section  21.7)  that 
lets  the  intercept  vary  across  individuals  while  the  slope  coefficients  are  not  random. 

For  nonlinear  models  a  similar  result  need  not  hold,  and  random  parameter  models 
can  be  preferred  as  they  permit  a  richer  parameterization.  Random  parameter  models 
are  consistent  with  existence  of  heterogeneous  responses  of  individuals  to  changes  in 
x.  A  leading  example  is  random  parameters  logit  in  Section  15.7. 

More  serious  complications  can  arise  when  the  regression  parameters  /3;-  for  an 
individual  are  related  to  observed  individual  characteristics.  Then  OLS  estimation  can 
lead  to  inconsistent  parameter  estimation.  An  example  is  the  fixed  effects  model  for 
panel  data  (see  Section  21.6)  for  which  OLS  estimation  of  y  on  x  is  inconsistent.  In 
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this  example,  but  not  in  all  such  examples,  alternative  consistent  estimators  for  a  subset 
of  the  regression  parameters  are  available. 


4.8.  Instrumental  Variables 

A  major  complication  that  is  emphasized  in  microeconometrics  is  the  possibility  of 
inconsistent  parameter  estimation  caused  by  endogenous  regressors.  Then  regression 
estimates  measure  only  the  magnitude  of  association,  rather  than  the  magnitude  and 
direction  of  causation,  both  of  which  are  needed  for  policy  analysis. 

The  instrumental  variables  estimator  provides  a  way  to  nonetheless  obtain  consis¬ 
tent  parameter  estimates.  This  method,  widely  used  in  econometrics  and  rarely  used 
elsewhere,  is  conceptually  difficult  and  easily  misused. 

We  provide  a  lengthy  expository  treatment  that  defines  an  instrumental  variable  and 
explains  how  the  instrumental  variables  method  works  in  a  simple  setting. 


4.8.1.  Inconsistency  of  OLS 

Consider  the  scalar  regression  model  with  dependent  variable  y  and  single  regressor  x. 
The  goal  of  regression  analysis  is  to  estimate  the  conditional  mean  function  E[y|x].  A 
linear  conditional  mean  model,  without  intercept  for  notational  convenience,  specifies 

E[y|x]  =  fix.  (4.42) 

This  model  without  intercept  subsumes  the  model  with  intercept  if  dependent  and 
regressor  variables  are  deviations  from  their  respective  means.  Interest  lies  in  obtaining 
a  consistent  estimate  of  fi  as  this  gives  the  change  in  the  conditional  mean  given  an 
exogenous  change  in  x.  For  example,  interest  may  lie  in  the  effect  in  earnings  caused 
by  an  increase  in  schooling  attributed  to  exogenous  reasons,  such  as  an  increase  in  the 
minimum  age  at  which  students  leave  school,  that  are  not  a  choice  of  the  individual. 
The  OLS  regression  model  specifies 

y  =  /3x  +  u,  (4.43) 

where  a  is  an  error  term.  Regression  of  y  on  x  yields  OLS  estimate  fi  of  fi. 

Standard  regression  results  make  the  assumption  that  the  regressors  are  uncorrelated 
with  the  errors  in  the  model  (4.43).  Then  the  only  effect  of  x  on  y  is  a  direct  effect  via 
the  term  fix.  We  have  the  following  path  analysis  diagram: 


u 


where  there  is  no  association  between  x  and  u.  So  x  and  u  are  independent  causes 
of  y. 

However,  in  some  situations  there  may  be  an  association  between  regressors  and 
errors.  For  example,  consider  regression  of  log-eamings  (y)  on  years  of  schooling  (x). 
The  error  term  u  embodies  all  factors  other  than  schooling  that  determine  earnings. 
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such  as  ability.  Suppose  a  person  has  a  high  level  of  u,  as  a  result  of  high  (unobserved) 
ability.  This  increases  earnings,  since  y  =  fix  +  u,  but  it  may  also  lead  to  higher  lev¬ 
els  of  x,  since  schooling  is  likely  to  be  higher  for  those  with  high  ability.  A  more 
appropriate  path  diagram  is  then  the  following: 


x 


y 


t  / 

u 


where  now  there  is  an  association  between  x  and  u. 

What  are  the  consequences  of  this  correlation  between  x  and  n?  Now  higher  levels 
of  x  have  two  effects  on  y.  From  (4.43)  there  is  both  a  direct  effect  via  fix  and  an 
indirect  effect  via  u  affecting  x,  which  in  turn  affects  y.  The  goal  of  regression  is 
to  estimate  only  the  first  effect,  yielding  an  estimate  of  f).  The  OLS  estimate  will 
instead  combine  these  two  effects,  giving  fi  >  fi  in  this  example  where  both  effects 
are  positive.  Using  calculus,  we  have  y  =  fix  +  u(x)  with  total  derivative 


dy  du 

—  =  p+  — . 
dx  dx 


(4.44) 


The  data  give  information  on  dy/dx,  so  OLS  estimates  the  total  effect  fi  +du/dx 
rather  than  fi  alone.  The  OLS  estimator  is  therefore  biased  and  inconsistent  for  fi, 
unless  there  is  no  association  between  x  and  u. 

A  more  formal  treatment  of  the  linear  regression  model  with  K  regressors  leads  to 
the  same  conclusion.  From  Section  4.7.1  a  necessary  condition  for  consistency  of  OLS 
is  that  plim  IV_1X'u  =  0.  Consistency  requires  that  the  regressors  are  asymptotically 
uncorrelated  with  the  errors.  From  (4.37)  the  magnitude  of  the  inconsistency  of  OLS 
is  (X'X)  1  X'u,  the  OLS  coefficient  from  regression  of  u  on  x.  This  is  just  the  OLS 
estimate  of  du/dx,  confirming  the  intuitive  result  in  (4.44). 


4.8.2.  Instrumental  Variable 

The  inconsistency  of  OLS  is  due  to  endogeneity  of  x,  meaning  that  changes  in  x  are 
associated  not  only  with  changes  in  y  but  also  changes  in  the  error  u.  What  is  needed 
is  a  method  to  generate  only  exogenous  variation  in  x.  An  obvious  way  is  through  a 
randomized  experiment,  but  for  most  economics  applications  such  experiments  are  too 
expensive  or  even  infeasible. 


Definition  of  an  Instrument 

A  crude  experimental  or  treatment  approach  is  still  possible  using  observational  data, 
provided  there  exists  an  instrument  z  that  has  the  property  that  changes  in  z  are  asso¬ 
ciated  with  changes  in  x  but  do  not  lead  to  change  in  y  (aside  from  the  indirect  route 
via  x).  This  leads  to  the  following  path  diagram: 

z  — *  x  — *  y 

t  / 

u 
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which  introduces  a  variable  z  that  is  causally  associated  with  x  but  not  u.  It  is  still 
the  case  that  z  and  y  will  be  correlated,  but  the  only  source  of  such  correlation  is  the 
indirect  path  of  z  being  correlated  with  x,  which  in  turn  determines  y.  The  more  direct 
path  of  z  being  a  regressor  in  the  model  for  y  is  ruled  out. 

More  formally,  a  variable  z  is  called  an  instrument  or  instrumental  variable  for 
the  regressor  x  in  the  scalar  regression  model  y  =  fix  +  u  if  (1)  z  is  uncorrelated  with 
the  error  u  and  (2)  z  is  correlated  with  the  regressor  x. 

The  first  assumption  excludes  the  instrument  z  from  being  a  regressor  in  the  model 
for  y,  since  if  instead  y  depended  on  both  x  and  z,  and  y  is  regressed  on  x  alone,  then 
z  is  being  absorbed  into  the  error  so  that  z  will  then  be  correlated  with  the  error.  The 
second  assumption  requires  that  there  is  some  association  between  the  instrument  and 
the  variable  being  instrumented. 


Examples  of  an  Instrument 

In  many  microeconometric  applications  it  is  difficult  to  find  legitimate  instruments. 
Here  we  provide  two  examples. 

Suppose  we  want  to  estimate  the  response  of  market  demand  to  exogenous  changes 
in  market  price.  Quantity  demanded  clearly  depends  on  price,  but  prices  are  not  ex¬ 
ogenously  given  since  they  are  determined  in  part  by  market  demand.  A  suitable  in¬ 
strument  for  price  is  a  variable  that  is  correlated  with  price  but  does  not  directly  affect 
quantity  demanded.  An  obvious  candidate  is  a  variable  that  affects  market  supply,  since 
this  also  affect  prices,  but  is  not  a  direct  determinant  of  demand.  An  example  is  a  mea¬ 
sure  of  favorable  growing  conditions  if  an  agricultural  product  is  being  modeled.  The 
choice  of  instrument  here  is  uncontroversial,  provided  favorable  growing  conditions 
do  not  directly  affect  demand,  and  is  helped  greatly  by  the  formal  economic  model  of 
supply  and  demand. 

Next  suppose  we  want  to  estimate  the  returns  to  exogenous  changes  in  schooling. 
Most  observational  data  sets  lack  measures  of  individual  ability,  so  regression  of  earn¬ 
ings  on  schooling  has  error  that  includes  unobserved  ability  and  hence  is  correlated 
with  the  regressor  schooling.  We  need  an  instrument  z  that  is  correlated  with  school¬ 
ing,  uncorrelated  with  ability,  and  more  generally  uncorrelated  with  the  error  term, 
which  means  that  it  cannot  directly  determine  earnings. 

One  popular  candidate  for  z  is  proximity  to  a  college  or  university  (Card,  1995). 
This  clearly  satisfies  condition  2  because,  for  example,  people  whose  home  is  a  long 
way  from  a  community  college  or  state  university  are  less  likely  to  attend  college.  It 
most  likely  satisfies  1,  though  since  it  can  be  argued  that  people  who  live  a  long  way 
from  a  college  are  more  likely  to  be  in  low-wage  labor  markets  one  needs  to  estimate 
a  multiple  regression  for  y  that  includes  additional  regressors  such  as  indicators  for 
nonmetropolitan  area. 

A  second  candidate  for  the  instrument  is  month  of  birth  (Angrist  and  Krueger, 
1991).  This  clearly  satisfies  condition  1  as  there  is  no  reason  to  believe  that  month 
of  birth  has  a  direct  effect  on  earnings  if  the  regression  includes  age  in  years.  Surpris¬ 
ingly  condition  2  may  also  be  satisfied,  as  birth  month  determines  age  of  first  entry 
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into  school  in  the  USA,  which  in  turn  may  affect  years  of  schooling  since  laws  often 
specify  a  minimum  school-leaving  age.  Bound,  Jaeger,  and  Baker  (1995)  provide  a 
critique  of  this  instrument. 

The  consequences  of  choosing  poor  instruments  are  considered  in  detail  in  Sec¬ 
tion  4.9. 


4.8.3.  Instrumental  Variables  Estimator 

For  regression  with  scalar  regressor  x  and  scalar  instrument  z,  the  instrumental  vari¬ 
ables  (IV)  estimator  is  defined  as 

Pw  =  (z'x)_1z'y,  (4.45) 


where,  in  the  scalar  regressor  case  z,  x  and  y  are  N  x  1  vectors.  This  estimator  provides 
a  consistent  estimator  for  the  slope  coefficient  ft  in  the  linear  model  y  =  fix  +  it  if  z, 
is  correlated  with  x  and  uncorrelated  with  the  error  term. 

There  are  several  ways  to  derive  (4.45).  We  provide  an  intuitive  derivation,  one  that 
differs  from  derivations  usually  presented  such  as  that  in  Section  6.2.5. 

Return  to  the  earnings-schooling  example.  Suppose  a  one-unit  change  in  the  in¬ 
strument  z  is  associated  with  0.2  more  years  of  schooling  and  with  a  $500  increase 
in  annual  earnings.  This  increase  in  earnings  is  a  consequence  of  the  indirect  effect 
that  increase  in  z  led  to  increase  in  schooling,  which  in  turn  increases  income.  Then  it 
follows  that  0.2  years  additional  schooling  is  associated  with  a  $500  increase  in  earn¬ 
ings,  so  that  a  one-year  increase  in  schooling  is  associated  with  a  $500/0.2  =  $2,500 
increase  in  earnings.  The  causal  estimate  of  /I  is  therefore  2,500.  In  mathematical 
notation  we  have  estimated  the  changes  dx/dz  and  dy/dz  and  calculated  the  causal 
estimator  as 


Av  = 


dy/dz 

dx/dz 


(4.46) 


This  approach  to  identification  of  the  causal  parameter  /3  is  given  in  Heckman  (2000, 
p.  58);  see  also  the  example  in  Section  2.4.2. 

All  that  remains  is  consistent  estimation  of  dy/dz.  and  dx/dz-  The  obvious  way  to 
estimate  dy/dz  is  by  OLS  regression  of  y  on  z  with  slope  estimate  (z'z)-Iz'y.  Sim¬ 
ilarly,  estimate  dx/dz  by  OLS  regression  of  x  on  z  with  slope  estimate  (z'z)  'z'x. 
Then 


(z'z)  Vy 
(z'z)_1z'x 


=  (z'xU'z'y. 


(4.47) 


4.8.4.  Wald  Estimator 

A  leading  simple  example  of  IV  is  one  where  the  instrument  z  is  a  binary  instru¬ 
ment.  Denote  the  subsample  averages  of  y  and  x  by  yi  and  x\ ,  respectively,  when 
z  =  1  and  by  yo  and  jfo,  respectively,  when  z  =  0.  Then  Ay/ A z.  =  (yi  —  yo)  and 
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Ax/ A z  =  {x\  —  xo),  and  (4.46)  yields 

iSwaid  =  (4.48) 

(ai  -  x0) 

This  estimator  is  called  the  Wald  estimator,  after  Wald  (1940),  or  the  grouping  esti¬ 
mator. 

The  Wald  estimator  can  also  be  obtained  from  the  formula  (4.45).  For  the  no¬ 
intercept  model  variables  are  measured  in  deviations  from  means,  so  z'y  =  JT(z,-  —  z) 
(yi  —  y)-  For  binary  z  this  yields  z'y  =  N\  (yi  —  y)  =  A^i  A^o(^i  —  yo)/(V,  where  No 
and  N\  are  the  number  of  observations  for  which  z  =  0  and  z  =  1.  This  result  uses 
yi  —  y  =  (N0yi  +  Niyi)/N  -  ( Adyo  +  A^iy>i)/A^  =  N0(yi  -  yo)/N.  Similarly,  z'x  = 
1Vi1Vo(ai  —  xq)/N .  Combining  these  results,  we  have  that  (4.45)  yields  (4.48). 

For  the  eamings-schooling  example  it  is  being  assumed  that  we  can  define  two 
groups  where  group  membership  does  not  directly  determine  earnings,  though  it  does 
affect  level  of  schooling  and  hence  indirectly  affects  earnings.  Then  the  IV  estimate  is 
the  difference  in  average  earnings  across  the  two  groups  divided  by  the  difference  in 
average  schooling  across  the  two  groups. 


4.8.5.  Sample  Covariance  and  Correlation  Analysis 


The  IV  estimator  can  also  be  interpreted  in  terms  of  covariances  or  correlations. 
For  sample  covariances  we  have  directly  from  (4.45)  that 


_  Cov[z,  y] 
Cov[z,  x]  ’ 


(4.49) 


where  here  Cov[  ]  is  being  used  to  denote  sample  covariance. 

For  sample  correlations,  note  that  the  OLS  estimator  for  the  model  (4.43)  can  be 
written  as  /3OLS  =  rxys/ y^/VxA,  where  rxy  =  x'y/y/ (x,x)(y,y)  is  the  sample  correla¬ 
tion  between  x  and  y.  This  leads  to  the  interpretation  of  the  OLS  estimator  as  implying 
that  a  one  standard  deviation  change  in  x  is  associated  with  an  rxy  standard  deviation 
change  in  y.  The  problem  is  that  the  correlation  rxy  is  contaminated  by  correlation 
between  x  and  u.  An  alternative  approach  is  to  measure  the  correlation  between  x  and 
y  indirectly  by  the  correlation  between  z  and  y  divided  by  the  correlation  between  z 
and  x.  Then 


which  can  be  shown  to  equal  /ilv  in  (4.45). 


(4.50) 


4.8.6.  IV  Estimation  for  Multiple  Regression 
Now  consider  the  multiple  regression  model  with  typical  observation 

y  =  x'/3  +  u , 

with  K  regressor  variables,  so  that  x  and  /3  are  K  x  1  vectors. 
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Instruments 

Assume  the  existence  of  an  r  x  1  vector  of  instruments  z,  with  r  >  K,  satisfying  the 
following: 

1.  z  is  uncorrelated  with  the  error  u . 

2.  z  is  correlated  with  the  regressor  vector  x. 

3.  z  is  strongly  correlated,  rather  than  weakly  correlated,  with  the  regressor  vector  x. 

The  first  two  properties  are  necessary  for  consistency  and  were  presented  earlier  in 
the  scalar  case.  The  third  property,  defined  in  Section  4.9.1,  is  a  strengthening  of  the 
second  to  ensure  good  finite-sample  performance  of  the  IV  estimator. 

In  the  multiple  regression  case  z  and  x  may  share  some  common  components. 
Some  components  of  x,  called  exogenous  regressors,  may  be  uncorrelated  with  u. 
These  components  are  clearly  suitable  instruments  as  they  satisfy  conditions  1  and 
2.  Other  components  of  x,  called  endogenous  regressors,  may  be  correlated  with  u. 
These  components  lead  to  inconsistency  of  OLS  and  are  also  clearly  unsuitable  in¬ 
struments  as  they  do  not  satisfy  condition  1.  Partition  x  into  x  =  [xj  x'2]',  where  xi 
contains  endogenous  regressors  and  xi  contains  exogenous  regressors.  Then  a  valid 
instrument  is  z  =  [zj  Xj]\  where  X2  can  be  an  instrument  for  itself,  but  we  need  to  find 
at  least  as  many  instruments  zj  as  there  are  endogenous  variables  xi. 

Identification 

Identification  in  a  simultaneous  equations  model  was  presented  in  Section  2.5.  Here  we 
have  a  single  equation.  The  order  condition  requires  that  the  number  of  instruments 
must  at  least  equal  the  number  of  independent  endogenous  components,  so  that  r  >  K. 
The  model  is  said  to  be  just-identified  if  r  =  K  and  overidentified  if  r  >  K. 

In  many  multiple  regression  applications  there  is  only  one  endogenous  regressor. 
For  example,  the  earnings  on  schooling  regression  will  include  many  other  regressors 
such  as  age,  geographic  location,  and  family  background.  Interest  lies  in  the  coefficient 
on  schooling,  but  this  is  an  endogenous  variable  most  likely  correlated  with  the  error 
because  ability  is  unobserved.  Possible  candidates  for  the  necessary  single  instrument 
for  schooling  have  already  been  given  in  Section  4.8.2. 

If  an  instrument  fails  the  first  condition  the  instrument  is  an  invalid  instrument.  If 
an  instrument  fails  the  second  condition  the  instrument  is  an  irrelevant  instrument, 
and  the  model  may  be  unidentified  if  too  few  instruments  are  relevant.  The  third  con¬ 
dition  fails  when  very  low  correlation  exists  between  the  instrument  and  the  endoge¬ 
nous  variable  being  instrumented.  The  model  is  said  to  be  weakly  identified  and  the 
instrument  is  called  a  weak  instrument. 

Instrumental  Variables  Estimator 

When  the  model  is  just-identified,  so  that  r  =  K ,  the  instrumental  variables  estima¬ 
tor  is  the  obvious  matrix  generalization  of  (4.45) 

Av  =  (Z'X)-1  Z'y,  (4.51) 
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where  Z  is  an  N  x  K  matrix  with  / 1 li  row  z'.  Substituting  the  regression  model  y  = 
X/3  +  u  for  y  in  (4.51)  yields 

Av  =  (Z'X)-1  Z'[X/3  +  u] 

=  f3  +  (Z'X)-1  Z'u 

=  (3+  (A^Z'X)-1  N-1  Z'u. 

It  follows  immediately  that  the  IV  estimator  is  consistent  if 

plim  N~ 1  Z'u  =  0 


and 


plimV'Z'X/O. 

These  are  essentially  conditions  1  and  2  that  z  is  uncorrelated  with  u  and  correlated 
with  x.  To  ensure  that  the  inverse  of  N  1  Z'X  exists  it  is  assumed  that  Z'X  is  of  full 
rank  K,  a  stronger  assumption  than  the  order  condition  that  r  =  K. 

With  heteroskedastic  errors  the  IV  estimator  is  asymptotically  normal  with  mean  /3 
and  variance  matrix  consistently  estimated  by 

Vt3iV]  =  (Z'X)“1Z'fiZ(Z'X)-1,  (4.52) 

where  12  =  Diag[w?  ].  This  result  is  obtained  in  a  manner  similar  to  that  for  OLS  given 
in  Section  4.4.4. 

The  IV  estimator,  although  consistent,  leads  to  a  loss  of  efficiency  that  can  be  very 
large  in  practice.  Intuitively  IV  will  not  work  well  if  the  instrument  z  has  low  correla¬ 
tion  with  the  regressor  x  (see  Section  4.9.3). 


4.8.7.  Two-Stage  Least  Squares 

The  IV  estimator  in  (4.5 1 )  requires  that  the  number  of  instruments  equals  the  number 
of  regressors.  For  overidentified  models  the  IV  estimator  can  be  used,  by  discarding 
some  of  the  instruments  so  that  the  model  is  just-identified.  However,  an  asymptotic 
efficiency  loss  can  occur  when  discarding  these  instruments. 

Instead,  a  common  procedure  is  to  use  the  two-stage  least-squares  (2SLS)  estima¬ 
tor 

Asls  =  [X'ZlZ'Zr'Z'X]-1  [X'Z(Z'Z)-1Z'y] .  (4.53) 

presented  and  motivated  in  Section  6.4. 

The  2SLS  estimator  is  an  IV  estimator.  In  a  just-identified  model  it  simplifies  to 
the  IV  estimator  given  in  (4.51)  with  instruments  Z.  In  an  overidentified  model  the 
2SLS  estimator  equals  the  IV  estimator  given  in  (4.51)  if  the  instruments  are  X,  where 
X  =  Z(Z'Z)_1Z'X  is  the  predicted  value  of  x  from  OLS  regression  of  x  on  z. 

The  2SLS  estimator  gets  its  name  from  the  result  that  it  can  be  obtained  by  two 
consecutive  OLS  regressions:  OLS  regression  of  x  on  z  to  get  x  followed  by  OLS 
of  y  on  x,  which  gives  /32Sls-  This  interpretation  does  not  necessarily  generalize  to 
nonlinear  regressions;  see  Section  6.5.6. 
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The  2SLS  estimator  is  often  expressed  more  compactly  as 

/^2SLS  =  [X'PZX]  1  [X'Pzy] ,  (4.54) 

where 

pz  =  z(z'zr'z' 

is  an  idempotent  projection  matrix  that  satisfies  Pz  =  P'z,  Pz P'z  =  Pz,  and  PzZ  =  Z. 
The  2SLS  estimator  can  be  shown  to  be  asymptotically  normal  distributed  with  esti¬ 
mated  asymptotic  variance 

V[/32sls]  =  N  [X'PzX]'1  [x'ZCZ'zr'siZ'Zr'Z'x]  [X'PzXp1 ,  (4.55) 

where  in  the  usual  case  of  heteroskedastic  errors  S  =  /V  -1  -ujzjz'i  and  u)  =  _y,  — 
x./32sls-  A  commonly  used  small-sample  adjustment  is  to  divide  by  N  —  K  rather 
than  N  in  the  formula  for  S. 

In  the  special  case  that  errors  are  homoskedastic,  simplification  occurs  and 
V[/32s[.sl  =  v2[X'PzX]  .  This  latter  result  is  given  in  many  introductory  treatments, 
but  the  more  general  formula  (4.55)  is  preferred  as  the  modem  approach  is  to  treat 
errors  as  potentially  heteroskedastic. 

For  overidentified  models  with  heteroskedastic  errors  an  estimator  that  White 
(1982)  calls  the  two-stage  instrumental  variables  estimator  is  more  efficient  than 
2SLS.  Moreover,  some  commonly  used  model  specification  tests  require  estimation 
by  this  estimator  rather  than  2SLS.  For  details  see  Section  6.4.2. 


4.8.8.  IV  Example 

As  an  example  of  IV  estimation,  consider  estimation  of  the  slope  coefficient  of  x  for 
the  dgp 

y  —  0  +  0.5.r  +  u , 
x  =  0  +  z  +  v, 

where  z  ~  W"[2,  1]  and  (u,  v)  are  joint  normal  with  means  0,  variances  1,  and  correla¬ 
tion  0.8. 

OLS  of  y  on  x  yields  inconsistent  estimates  as  x  is  correlated  with  u  since  by 
construction  x  is  correlated  with  v,  which  in  turn  is  correlated  with  a.  IV  estimation 
yields  consistent  estimates.  The  variable  z  is  a  valid  instrument  since  by  construction 
is  uncorrelated  with  u  but  is  correlated  with  x.  Transformations  of  z,  such  as  z3,  are 
also  valid  instruments. 

Various  estimates  and  associated  standard  errors  from  a  generated  data  sample  of 
size  10,000  are  given  in  Table  4.4.  We  focus  on  the  slope  coefficient. 

The  OLS  estimator  is  inconsistent,  with  slope  coefficient  estimate  of  0.902  being 
more  than  50  standard  errors  from  the  true  value  of  0.5.  The  remaining  estimates  are 
consistent  and  are  all  within  two  standard  errors  of  0.5. 

There  are  several  ways  to  compute  the  IV  estimator.  The  slope  coefficient  from 
OLS  regression  of  y  on  z  is  0.5168  and  from  OLS  regression  of  x  on  z  it  is  1.0124, 
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Table  4.4.  Instrumental  Variables  Example a 


OLS 

IV 

2SLS 

IV  (z3) 

Constant 

-0.804 

-0.017 

-0.017 

-0.014 

(0.014) 

(0.022) 

(0.032) 

(0.025) 

X 

0.902 

0.510 

0.510 

0.509 

(0.006) 

(0.010) 

(0.014) 

(0.012) 

R2 

0.709 

0.576 

0.576 

0.574 

a  Generated  data  for  a  sample  size  of  10,000.  OLS  is  inconsistent  and  other  esti¬ 
mators  are  consistent.  Robust  standard  errors  are  reported  though  they  are  unnec¬ 
essary  here  as  errors  are  homoskedastic.  The  2SLS  standard  errors  are  incorrect. 
The  data-generating  process  is  given  in  the  text. 


yielding  an  IV  estimate  of  0.5 168/ 1.0124  =  0.510  using  (4.47).  In  practice  one  instead 
directly  computes  the  IV  estimator  using  (4.45)  or  (4.5 1),  with  z  used  as  the  instrument 
for  x  and  standard  errors  computed  using  (4.52).  The  2SLS  estimator  (see  (4.54)) 
can  be  computed  by  OLS  regression  of  y  on  x,  where  x  is  the  prediction  from  OLS 
regression  of  x  on  z.  The  2SLS  estimates  exactly  equal  the  IV  estimates  in  this  just- 
identified  model,  though  the  standard  errors  from  this  OLS  regression  of  y  on  x  are 
incorrect  as  will  be  explained  in  Section  6.4.5. 

The  final  column  uses  -:i  rather  than  z  as  the  instrument  for  x.  This  alternative  IV 
estimator  is  consistent,  since  z3  is  uncorrelated  with  u  and  correlated  with  x.  However, 
it  is  less  efficient  for  this  particular  dgp,  and  the  standard  error  of  the  slope  coefficient 
rises  from  0.010  to  0.012. 

There  is  an  efficiency  loss  in  IV  estimation  compared  to  OLS  estimation,  see  (4.61) 
for  a  general  result  for  the  case  of  single  regressor  and  single  instrument.  Here  r2  _  = 
0.510,  not  given  in  Table  4.4,  is  high  so  the  loss  is  not  great  and  the  standard  error  of 
the  slope  coefficient  increases  somewhat  from  0.006  to  0.010.  In  practice  the  efficiency 
loss  can  be  much  greater  than  this. 


4.9.  Instrumental  Variables  in  Practice 

Important  practical  issues  include  determining  whether  IV  methods  are  necessary  and, 
if  necessary,  determining  whether  the  instruments  are  valid.  The  relevant  specification 
tests  are  presented  in  Section  8.4.  Unfortunately,  the  validity  of  tests  are  limited.  They 
require  the  assumption  that  in  a  just-identified  model  the  instruments  are  valid  and  test 
only  overidentifying  restrictions. 

Although  IV  estimators  are  consistent  given  valid  instruments,  as  detailed  in  the 
following,  IV  estimators  can  be  much  less  efficient  than  the  OLS  estimator  and  can 
have  a  finite-sample  distribution  that  for  usual  finite-sample  sizes  differs  greatly  from 
the  asymptotic  distribution.  These  problems  are  greatly  magnified  if  instruments  are 
weakly  correlated  with  the  variables  being  instrumented.  One  way  that  weak  instru¬ 
ments  can  arise  is  if  there  are  many  more  instruments  than  needed.  This  is  simply 
dealt  with  by  dropping  some  of  the  instruments  (see  also  Donald  and  Newey,  2001).  A 
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more  fundamental  problem  arises  when  even  with  the  minimal  number  of  instruments 
one  or  more  of  the  instruments  is  weak. 

This  section  focuses  on  the  problem  of  weak  instruments. 


4.9.1.  Weak  Instruments 

There  is  no  single  definition  of  a  weak  instrument.  Many  authors  use  the  following 

signals  of  a  weak  instrument,  presented  here  for  progressively  more  complex  models. 

•  Scalar  regressor  x  and  scalar  instrument  z'  A  weak  instrument  is  one  for  which  r2  _  is 
small. 

•  Scalar  regressor  x  and  vector  of  instruments  z:  The  instruments  are  weak  if  the  R2  from 
regression  of  x  on  z,  denoted  R2  2,  is  small  or  if  the  /-’-statistic  for  test  of  overall  fit  in 
this  regression  is  small. 

•  Multiple  regressors  x  with  only  one  endogenous:  A  weak  instrument  is  one  for  which 
the  partial  R2  is  low  or  the  partial  /-’-statistic  is  small,  where  these  partial  statistics  are 
defined  toward  the  end  of  Section  4.9.1. 

•  Multiple  regressors  x  with  several  endogenous:  There  are  several  measures. 


R2  Measures 


Consider  a  single  equation 

y  =  P\X\  +  x^/32  +  u,  (4.56) 

where  just  one  regressor  x\  is  endogenous  and  the  remaining  regressors  in  the  vector 
Xo  are  exogenous.  Assume  that  the  instrument  vector  z  includes  the  exogenous  instru¬ 
ments  xt,  as  well  as  least  one  other  instrument. 

One  possible  R2  measure  is  the  usual  R2  from  regression  of  x\  on  z.  However,  this 
could  be  high  only  because  x\  is  highly  correlated  with  xt  whereas  intuitively  we  really 
need  xi  to  be  highly  correlated  with  the  instrument(s)  other  than  X2. 

Bound,  Jaeger,  and  Baker  (1995)  therefore  proposed  use  of  a  partial  R2,  denoted 
R2,  that  purges  the  effect  of  xt.  R2  is  obtained  as  R2  from  the  regression 

(xi  -  J?i)  =  (z— z)'7  +  V,  (4.57) 

where  3ci  and  z  are  the  fitted  values  from  regressions  of  x\  on  X2  and  z  on  xi.  In  the 
just-identified  case  z  —  z  will  reduce  to  z  i  —  zi,  where  z i  is  the  single  instrument  other 
than  xt  and  zi  is  the  fitted  value  from  regression  of  zi  on  X2. 

It  is  not  unusual  for  R2  to  be  much  lower  than  /?2  z.  The  formula  for  R2  simplifies 
to  r2_  when  there  is  only  one  regressor  and  it  is  endogenous.  It  further  simplifies  to 
Cor[x,  z\  when  there  is  only  one  instrument. 

When  there  is  more  than  one  endogenous  variable,  analysis  is  less  straightforward 
as  a  number  of  generalizations  of  R2  have  been  proposed. 

Consider  a  single  equation  with  more  than  one  endogenous  variable  model  and  fo¬ 
cus  on  estimation  of  the  coefficient  of  the  first  endogenous  variable.  Then  in  (4.56) 


104 


4.9.  INSTRUMENTAL  VARIABLES  IN  PRACTICE 


X\  is  endogenous  and  additionally  some  of  the  variables  in  X2  are  also  endogenous. 
Several  alternative  measures  replace  the  right-hand  side  of  (4.57)  with  a  residual  that 
controls  for  the  presence  of  other  endogenous  regressors.  Shea  (1997)  proposed  a  par¬ 
tial  R2,  say  R*2,  that  is  computed  as  the  squared  sample  correlation  between  (x\  —  x\) 
and  (x\  —  V| ).  Here  (x\  —  x\)  is  again  the  residual  from  regression  of  x\  on  X2,  whereas 
(x\  —X\)  is  the  residual  from  regression  of  v)  (the  fitted  value  from  regression  of  x\ 
on  z)  on  X2  (the  fitted  value  from  regression  of  X2  on  z).  Poskitt  and  Skeels  (2002)  pro¬ 
posed  an  alternative  partial  R2,  which,  like  Shea’s  R*2,  simplifies  to  R2  when  there  is 
only  one  endogenous  regressor.  Hall,  Rudebusch,  and  Wilcox  (1996)  instead  proposed 
use  of  canonical  correlations. 

These  measures  for  the  coefficient  for  the  first  endogenous  variable  can  be  repeated 
for  the  other  endogenous  variables.  Poskitt  and  Skeels  (2002)  additionally  consider  an 
R2  measure  that  applies  jointly  to  instrumentation  of  all  the  endogenous  variables. 

The  problems  of  inconsistency  of  estimators  and  loss  of  precision  are  magnified 
as  the  partial  R2  measures  fall,  as  detailed  in  Sections  4.9.2  and  4.9.3.  See  especially 
(4.60)  and  (4.62). 


Partial  F-Statistics 

For  poor  finite-sample  performance,  considered  in  Section  4.9.4,  it  is  common  to  use 
a  related  measure,  the  F-statistic  for  whether  coefficients  are  zero  in  regression  of  the 
endogenous  regressor  on  instruments. 

For  a  single  regressor  that  is  endogenous  we  use  the  usual  overall  F-statistic,  for  a 
test  of  7r  =  0  in  the  regression  x  =  z'n  +  v  of  the  endogenous  regressor  on  the  instru¬ 
ments.  This  F-statistic  is  a  function  of  R2  z. 

More  commonly,  some  exogenous  regressors  also  appear  in  the  model,  and  in  model 
(4.56)  with  single  endogenous  regressor  xi  we  use  the  F-statistic  for  a  test  of  tv\  =  0 
in  the  regression 


x  =  z'j7Ti  +  Xj7r2  +  v,  (4.58) 

where  zj  are  the  instruments  other  than  the  exogenous  regressors  and  xi  are  the  ex¬ 
ogenous  regressors.  This  is  the  first-stage  regression  in  the  two-stage  least-squares 
interpretation  of  IV. 

This  statistic  is  used  as  a  signal  of  potential  finite-sample  bias  in  the  IV  estimator. 
In  Section  4.9.4  we  explain  results  of  Staiger  and  Stock  (1997)  that  suggest  a  value 
less  than  10  is  problematic  and  a  value  of  5  or  less  is  a  sign  of  extreme  finite-sample 
bias  and  we  consider  extension  to  more  than  one  endogenous  regressor. 


4.9.2.  Inconsistency  of  IV  Estimators 

The  essential  condition  for  consistency  of  IV  is  condition  1  in  Section  4.8.6,  that 
the  instrument  should  be  uncorrelated  with  the  error  term.  No  test  is  possible  in  the 
just-identified  case.  In  the  overidentified  case  a  test  of  the  overidentifying  assump¬ 
tions  is  possible  (see  Section  6.4.3).  Rejection  then  could  be  due  to  either  instrument 
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endogeneity  or  model  failure.  Thus  condition  1  is  difficult  to  test  directly  and  deter¬ 
mining  whether  an  instrument  is  exogenous  is  usually  a  subjective  decision,  albeit  one 
often  guided  by  economic  theory. 

It  is  always  possible  to  create  an  exogenous  instrument  through  functional  form 
restrictions.  For  example,  suppose  there  are  two  regressors  so  that  y  =  fi \X\  +  /Tv 2  + 
u,  with  x\  uncorrelated  with  u  and  vi  correlated  with  u.  Note  that  throughout  this 
section  all  variables  are  assumed  to  be  measured  in  departures  from  means,  so  that 
without  loss  of  generality  the  intercept  term  can  be  omitted.  Then  OLS  is  inconsistent, 
as  X2  is  endogenous.  A  seemingly  good  instrument  for  X2  is  x\,  since  x\  is  likely  to 
be  uncorrelated  with  u  because  x\  is  uncorrelated  with  it.  However,  the  validity  of 
this  instrument  requires  the  functional  form  restriction  on  the  conditional  mean  that 
x\  only  enters  the  model  linearly  and  not  quadratically.  In  practice  one  should  view  a 
linear  model  as  only  an  approximation,  and  obtaining  instruments  in  such  an  artificial 
way  can  be  easily  criticized. 

A  better  way  to  create  a  valid  instrument  is  through  alternative  exclusion  restric¬ 
tions  that  do  not  rely  so  heavily  on  choice  of  functional  form.  Some  practical  examples 
have  been  given  in  Section  4.8.2. 

Structural  models  such  as  the  classical  linear  simultaneous  equations  model  (see 
Sections  2.4  and  6.10.6)  make  such  exclusion  restrictions  very  explicit.  Even  then  the 
restrictions  can  often  be  criticized  for  being  too  ad  hoc,  unless  compelling  economic 
theory  supports  the  restrictions. 

For  panel  data  applications  it  may  be  reasonable  to  assume  that  only  current  data 
may  belong  in  the  equation  of  interest  -  an  exclusion  restriction  permitting  past  data 
to  be  used  as  instruments  under  the  assumption  that  errors  are  serially  uncorrelated 
(see  Section  22.2.4).  Similarly,  in  models  of  decision  making  under  uncertainty  (see 
Section  6.2.7),  lagged  variables  can  be  used  as  instruments  as  they  are  part  of  the 
information  set. 

There  is  no  formal  test  of  instrument  exogeneity  that  does  not  additionally  test 
whether  the  regression  equation  is  correctly  specified.  Instrument  exogeneity  in¬ 
evitably  relies  on  a  priori  information,  such  as  that  from  economic  or  statistical  theory. 
The  evaluation  by  Bound  et  al.  (1995,  pp.  446-447)  of  the  validity  of  the  instruments 
used  by  Angrist  and  Krueger  (1991)  provides  an  insightful  example  of  the  subtleties 
involved  in  determining  instrument  exogeneity. 

It  is  especially  important  that  an  instrument  be  exogenous  if  an  instrument  is  weak, 
because  with  weak  instruments  even  very  mild  endogeneity  of  the  instrument  can  lead 
to  IV  parameter  estimates  that  are  much  more  inconsistent  than  the  already  inconsistent 
OLS  parameter  estimates. 

For  simplicity  consider  linear  regression  with  one  regressor  and  one  instrument; 
hence  y  =  fix  +  it.  Then  performing  some  algebra,  left  as  an  exercise,  yields 

plim/V  -  p  =  Cor[z,  u]  ^  1  5g> 

plim/i0LS  —  p  Cor[x,  u]  Cor[z,  x] 

Thus  with  an  invalid  instrument  and  low  correlation  between  the  instrument  and  the 
regressor,  the  IV  estimator  can  be  even  more  inconsistent  than  OLS.  For  example, 
suppose  the  correlation  between  z  and  x  is  0. 1 ,  which  is  not  unusual  for  cross-section 
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data.  Then  IV  becomes  more  inconsistent  than  OLS  as  soon  as  the  correlation  coeffi¬ 
cient  between  z  and  a  exceeds  a  mere  0.1  times  the  correlation  coefficient  between  x 
and  u. 

Result  (4.59)  can  be  extended  to  the  model  (4.56)  with  one  endogenous  regressor 
and  several  exogenous  regressors,  iid  errors,  and  instruments  that  include  all  the  ex¬ 
ogenous  regressors.  Then 

plim/3i,2SLS  -  Pi  Cor[T,  u]  1 

- ^ - =  -  x  — ,  (4.60) 

plim/Sj  0LS  -  Cor [x,  w]  R- 

where  R2p  is  defined  after  (4.56).  For  extension  to  more  than  one  endogenous  regressor 
see  Shea  (1997). 

These  results,  emphasized  by  Bound  et  al.  (1995),  have  profound  implications  for 
the  use  of  IV.  If  instruments  are  weak  then  even  mild  instrument  endogeneity  can  lead 
to  IV  being  even  more  inconsistent  than  OLS.  Perhaps  because  the  conclusion  is  so 
negative,  the  literature  has  neglected  this  aspect  of  weak  instruments.  A  notable  recent 
exception  is  Hahn  and  Hausman  (2003a). 

Most  of  the  literature  assumes  that  condition  1  is  satisfied,  so  that  IV  is  consistent, 
and  focuses  on  other  complications  attributable  to  weak  instruments. 


4.9.3.  Low  Precision 

Although  IV  estimation  can  lead  to  consistent  estimation  when  OLS  is  inconsistent,  it 
also  leads  to  a  loss  in  precision.  Intuitively,  from  Section  4.8.2  the  instrument  z  is  a 
treatment  that  leads  to  exogenous  movement  in  x  but  does  so  with  considerable  noise. 

The  loss  in  precision  increases,  and  standard  errors  increase,  with  weaker  instru¬ 
ments.  This  is  easily  seen  in  the  simplest  case  of  a  single  endogenous  regressor  and 
single  instrument  with  iid  errors.  Then  the  asymptotic  variance  is 

V[/3IV]  =  cr2(x,z)_1z'z(z,x)_1  (4.61) 

=  [cr2/x'x]/[(z'x)2/(z'z)(x'x)] 

=  VfiSoLsl/r2,. 

For  example,  if  the  squared  sample  correlation  coefficient  between  z  and  x  equals  0.1, 
then  IV  standard  errors  are  10  times  those  of  OLS.  Moreover,  the  IV  estimator  has 
larger  variance  than  the  OLS  estimator  unless  Cor[z,  x]  =  1. 

Result  (4.61)  can  be  extended  to  the  model  (4.56)  with  one  endogenous  regressor 
and  several  exogenous  regressors,  iid  errors,  and  instruments  that  include  all  the  ex¬ 
ogenous  regressors.  Then 

se[/*l,2SLs]  =  set^l,OLs]/^P’  (4.62) 

where  se[  ]  denotes  asymptotic  standard  error  and  A'2  is  defined  after  (4.56).  For  exten¬ 
sion  to  more  than  one  endogenous  regressor  this  R2  is  replaced  by  the  A*2  proposed 
by  Shea  (1997).  This  provided  the  motivation  for  Shea’s  test  statistic. 

The  poor  precision  is  concentrated  on  the  coefficients  for  endogenous  variables.  For 
exogenous  variables  the  standard  errors  for  2SLS  coefficient  estimates  are  similar  to 
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those  for  OLS.  Intuitively,  exogenous  variables  are  being  instrumented  by  themselves, 
so  they  have  a  very  strong  instrument. 

For  the  coefficients  of  an  endogenous  regressor  it  is  a  low  partial  R2,  rather  than  R2, 
that  leads  to  a  loss  of  estimator  precision.  This  explains  why  2SLS  standard  errors  can 
be  much  higher  than  OLS  standard  errors  despite  the  high  raw  correlation  between  the 
endogenous  variable  and  the  instruments.  Going  the  other  way,  2SLS  standard  errors 
for  coefficients  of  endogenous  variables  that  are  much  larger  than  OLS  standard  errors 
provide  a  clear  signal  that  instruments  are  weak. 

Statistics  used  to  detect  low  precision  of  IV  caused  by  weak  instruments  are  called 
measures  of  instrument  relevance.  To  some  extent  they  are  unnecessary  as  the  prob¬ 
lem  is  easily  detected  if  IV  standard  errors  are  much  larger  than  OLS  standard  errors. 


4.9.4.  Finite-Sample  Bias 

This  section  summarizes  a  relatively  challenging  and  as  yet  unfinished  literature  on 
“weak  instruments”  that  focuses  on  the  practical  problem  that  even  in  “large”  samples 
asymptotic  theory  can  provide  a  poor  approximation  to  the  distribution  of  the  IV  esti¬ 
mator.  In  particular  the  IV  estimator  is  biased  in  finite  samples  even  if  asymptotically 
consistent.  The  bias  can  be  especially  pronounced  when  instruments  are  weak. 

This  bias  of  IV,  which  is  toward  the  inconsistent  OLS  estimator,  can  be  remark¬ 
ably  large,  as  demonstrated  in  a  simple  Monte  Carlo  experiment  by  Nelson  and  Startz 
(1990),  and  by  a  real  data  application  involving  several  hundred  thousand  observations 
but  very  weak  instruments  by  Bound  et  al.  (1995).  Moreover,  the  standard  errors  can 
also  be  very  biased,  as  also  demonstrated  by  Nelson  and  Startz  (1990). 

The  theoretical  literature  entails  quite  specialized  and  advanced  econometric  theory, 
as  it  is  actually  difficult  to  obtain  the  sample  mean  of  the  IV  estimator.  To  see  this, 
consider  adapting  to  the  IV  estimator  the  usual  proof  of  unbiasedness  of  the  OLS 
estimator  given  in  Section  4.4.8.  For  /3IV  defined  in  (4.51)  for  the  just-identified  case 
this  yields 

E[3IV]  =  /3  +  Ez,x.u[(Z,Xr1Z,u] 

=  P  +  Ez.x  [(Z'Xr'Z'  x  [E[u|Z,  X]] , 

where  the  unconditional  expectation  with  respect  to  all  stochastic  variables,  Z,  X, 
and  u,  is  obtained  by  first  taking  expectation  with  respect  to  u  conditional  on  Z 
and  X,  using  the  law  of  Iterated  Expectations  (see  Section  A. 8.).  An  obvious  suf¬ 
ficient  condition  for  the  IV  estimator  to  have  mean  f3  is  that  E[u|Z,  X]  =  0.  This 
assumption  is  too  strong,  however,  because  it  implies  E[u|X]  =  0,  in  which  case 
there  would  be  no  need  to  instrument  in  the  first  place.  So  there  is  no  simple  way 
to  obtain  E[/3IV].  A  similar  problem  does  not  arise  in  establishing  consistency.  Then 
/3IV  =  (3+  (AT’Z'X)  N~lZ'u,  where  the  term  N  1  Z'u  can  be  considered  in  isola¬ 
tion  of  X  and  the  assumption  E[u|Z]  =  0  leads  to  plim  V^'Z'u  =  0. 

Therefore  we  need  to  use  alternative  methods  to  obtain  the  mean  of  the  IV  estimator. 
Here  we  merely  summarize  key  results. 
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Initial  research  made  the  strong  assumption  of  joint  normality  of  variables  and  ho- 
moskedastic  errors.  Then  the  IV  estimator  has  a  Wishart  distribution  (defined  in  Chap¬ 
ter  13).  Surprisingly,  the  mean  of  the  IV  estimator  does  not  even  exist  in  the  just- 
identified  case,  a  signal  that  there  may  be  finite-sample  problems.  The  mean  does  exist 
if  there  is  at  least  one  overidentifying  restriction,  and  the  variance  exists  if  there  are  at 
least  two  overidentifying  restrictions.  Even  when  the  mean  exists  the  IV  estimator  is 
biased,  with  bias  in  the  direction  of  OLS.  With  more  overidentifying  restrictions  the 
bias  increases,  eventually  equaling  the  bias  of  the  OLS  estimator.  A  detailed  discussion 
is  given  in  Davidson  and  MacKinnon  (1993,  pp.  221-224).  Approximations  based  on 
power-series  expansions  have  also  been  used. 

What  determines  the  size  of  the  finite-sample  bias?  For  regression  with  a  single 
regressor  x  that  is  endogenous  and  is  related  to  the  instruments  z  by  the  reduced  form 
model  x  =  Z7v  +  v,  the  concentration  parameter  r2  is  defined  as  r2  =  -k'Z,Z,'i t/ct2. 
The  bias  of  IV  can  be  shown  to  be  an  increasing  function  of  r2.  The  quantity  r 2 / K, 
where  K  is  the  number  of  instruments,  is  the  population  analogue  of  the  /-'-statistic 
for  a  test  of  whether  7r  =0.  The  statistic  F  —  1,  where  F  is  the  actual  /-'-statistic  in 
the  first-stage  reduced  form  model,  can  be  shown  to  be  an  approximately  unbiased 
estimate  of  r 2 / K.  This  leads  to  tests  for  finite-sample  bias  being  based  on  the  F- 
statistic  given  in  Section  4.9.2. 

Staiger  and  Stock  (1997)  obtained  results  under  weaker  distributional  assumptions. 
In  particular,  normality  is  no  longer  needed.  Their  approach  uses  weak  instrument 
asymptotics  that  find  the  limit  distribution  of  IV  estimators  for  a  sequence  of  models 
with  r 2 / K  held  constant  as  N  oo.  In  a  simple  model  1  / F  provides  an  approximate 
estimate  of  the  finite-sample  bias  of  the  IV  estimator  relative  to  OLS.  More  generally, 
the  extent  of  the  bias  for  given  F  varies  with  the  number  of  endogenous  regressors  and 
the  number  of  instruments.  Simulations  show  that  to  ensure  that  the  maximal  bias  in 
IV  is  no  more  than  10%  that  of  OLS  we  need  F  >  10.  This  threshold  is  widely  cited 
but  falls  to  around  6.5,  for  example,  if  one  is  comfortable  with  bias  in  IV  of  20%  of 
that  for  OLS.  So  a  less  strict  rule  of  thumb  is  F  >  5.  Shea  (1997)  demonstrated  that 
low  partial  R 2  is  also  associated  with  finite-sample  bias  but  there  is  no  similar  rule  of 
thumb  for  use  of  partial  R2  as  a  diagnostic  for  finite-sample  bias. 

For  models  with  more  than  one  endogenous  regressor,  separate  /-'-statistics  can  be 
computed  for  each  endogenous  regressor.  For  a  joint  statistic  Stock,  Wright  and  Yogo 
(2002)  propose  using  the  minimum  eigenvalue  of  a  matrix  analogue  of  the  first-stage 
test  /-"-statistic.  Stock  and  Yogo  (2003)  present  relevant  critical  values  for  this  eigen¬ 
value  as  the  desired  degree  of  bias,  the  number  of  endogenous  variables,  and  the  num¬ 
ber  of  overidentifying  restrictions  vary.  These  tables  include  the  single  endogenous 
regressor  as  a  special  case  and  presume  at  least  two  overidentifying  restrictions,  so 
they  do  not  apply  to  just-identified  models. 

Finite-sample  bias  problems  arise  not  only  for  the  IV  estimate  but  also  for  IV  stan¬ 
dard  errors  and  test  statistics.  Stock  et  al.  (2002)  present  a  similar  approach  to  Wald 
tests  whereby  a  test  of  =  /3q  at  a  nominal  level  of  5%  is  to  have  actual  size  of,  say, 
no  more  than  15%.  Stock  and  Yogo  (2003)  also  present  detailed  tables  taking  this  size 
distortion  approach  that  include  just-identified  models. 


109 


LINEAR  MODELS 


4.9.5.  Responses  to  Weak  Instruments 

What  can  the  practitioner  do  in  the  face  of  weak  instruments? 

As  already  noted  one  approach  is  to  limit  the  number  of  instruments  used.  This  can 
be  done  by  dropping  instruments  or  by  combining  instruments. 

If  finite-sample  bias  is  a  concern  then  alternative  estimators  may  have  better  small- 
sample  properties  than  2SLS.  A  number  of  alternatives,  many  variants  of  IV,  are  pre¬ 
sented  in  Section  6.4.4. 

Despite  the  emphasis  on  finite-sample  bias  the  other  problems  created  by  weak 
instruments  may  be  of  greater  importance  in  applications.  It  is  possible  with  a  large 
enough  sample  for  the  first-stage  reduced  form  /-'-statistic  to  be  large  enough  that 
finite-sample  bias  is  not  a  problem.  Meanwhile,  the  partial  R2  may  be  very  small, 
leading  to  fragility  to  even  slight  correlation  between  the  model  error  and  instrument. 
This  is  difficult  to  test  for  and  to  overcome. 

There  also  can  be  great  loss  in  estimator  precision,  as  detailed  in  Sections  4.9.3 
and  4.9.4.  In  such  cases  either  larger  samples  are  needed  or  alternative  approaches  to 
estimating  causal  marginal  effects  must  be  used.  These  methods  are  summarized  in 
Section  2.8  and  presented  elsewhere  in  this  book. 


4.9.6.  IV  Application 

Kling  (2001)  analyzed  in  detail  the  use  of  college  proximity  as  an  instrument  for 
schooling.  Here  we  use  the  same  data  from  the  NLS  young  men’s  cohort  on  3,010 
males  aged  24  to  34  years  old  in  1976  as  used  to  produce  Table  1  of  Kling  (2001)  and 
originally  used  by  Card  (1995).  The  model  estimated  is 

In  Wi  =  a  +  fosj  +  foa  +  foe2  +  x^-7  +  n,- , 

where  .v  denotes  years  of  schooling,  e  denotes  years  of  work  experience,  e2  denotes  ex¬ 
perience  squared,  and  xi  is  a  vector  of  26  control  variables  that  are  mainly  geographic 
indicators  and  measure  of  parental  education. 

The  schooling  variable  is  considered  endogenous,  owing  to  lack  of  data  on  ability. 
Additionally,  the  two  work  experience  variables  are  endogenous,  since  work  experi¬ 
ence  is  calculated  as  age  minus  years  of  schooling  minus  six,  as  is  common  in  this 
literature,  and  schooling  is  endogenous.  At  least  three  instruments  are  needed. 

Here  exactly  three  instruments  are  used,  so  the  model  is  just-identified.  The  first 
instrument  is  col4,  an  indicator  for  whether  a  four-year  college  is  nearby.  This  instru¬ 
ment  has  already  been  discussed  in  Section  4.8.2.  The  other  two  instruments  are  age 
and  age  squared.  These  are  highly  correlated  with  experience  and  experience  squared, 
yet  it  is  believed  they  can  be  omitted  from  the  model  for  log-wage  since  it  is  work 
experience  that  matters.  The  remaining  regressor  vector  x2  is  used  as  an  instrument  for 
itself. 

Although  age  is  clearly  exogenous,  some  unobservables  such  as  social  skills  may  be 
correlated  with  both  age  and  wage.  Then  the  use  of  age  and  age  squared  as  instruments 
can  be  questioned.  This  illustrates  the  general  point  that  there  can  be  disagreement  on 
assumptions  of  instrument  validity. 
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Table  4.5.  Returns  to  Schooling:  Instrumental 
Variables  Estimates a 


OLS 

IV 

Schooling  (s) 

0.073 

0.132 

(0.004) 

(0.049) 

R2 

0.304 

0.207 

Shea’s  partial  R2 

- 

0.006 

First-stage  F-statistic  for  s 

- 

8.07 

a  Sample  of  3,010  young  males.  Dependent  variable  is  log  hourly 
wage.  Coefficient  and  standard  error  for  schooling  given;  esti¬ 
mates  for  experience,  experience  squared,  26  control  variables, 
and  an  intercept  are  not  reported.  For  the  three  endogenous  re¬ 
gressors  -  schooling  ( s ),  experience  ( e ),  and  experience  squared 
( e 2)  -  the  three  instruments  are  an  indicator  for  whether  a  four- 
year  college  (col)  is  nearby,  age,  and  age  squared.  The  partial 
R2  and  first-stage  E-statistic  are  weak  instruments  diagnostics 
explained  in  the  test. 


Results  are  given  in  Table  4.5.  The  OLS  estimate  of  is  0.073,  so  that  wages 
rise  by  7.6%  (=  100  x  (e073  —  1))  on  average  with  each  extra  year  of  schooling.  This 
estimate  is  an  inconsistent  estimate  of  f3\  given  omitted  ability.  The  IV  estimate,  or 
equivalently  the  2SLS  estimate  since  the  model  is  just-identified,  is  0.132.  An  extra 
year  of  schooling  is  estimated  to  lead  to  a  14.1%  (=  100  x  (c132  —  1))  increase  in 
wage. 

The  IV  estimator  is  much  less  efficient  than  OLS.  A  formal  test  does  not  reject  ho- 
moskedasticity  and  we  follow  Kling  (2001)  and  use  the  usual  standard  errors,  which 
are  very  close  to  the  heteroskedastic-robust  standard  errors.  The  standard  error  of 
i  ols  is  0.004  whereas  that  for  /d  ]  r v  is  0.049,  over  10  times  larger.  The  standard 
errors  for  the  other  two  endogenous  regressors  are  about  4  times  larger  and  the  stan¬ 
dard  errors  for  the  exogenous  regressors  are  about  1.2  times  larger.  The  R2  falls  from 
0.304  to  0.207. 

R2  measures  confirm  that  the  instruments  are  not  very  relevant  for  schooling.  A 
simple  test  is  to  note  that  the  regression  (4.58)  of  schooling  on  all  of  the  instruments 
yields  R2  =  0.297,  which  only  falls  a  little  to  R2  =  0.291  if  the  three  additional  in¬ 
struments  are  dropped.  More  formally.  Shea’s  partial  R2  here  equals  0.0064  =  0.082, 
which  from  (4.62)  predicts  that  the  standard  error  of  j  IV  will  be  inflated  by  a  multiple 
12.5  =  1  /0.08,  very  close  to  the  inflation  observed  here.  This  reduces  the  /-statist ic  on 
schooling  from  19.64  to  2.68.  In  many  applications  such  a  reduction  would  lead  to  sta¬ 
tistical  insignificance.  In  addition,  from  Section  4.9.2  even  slight  correlation  between 
the  instrument  co/4,  and  the  error  term  m,  will  lead  to  inconsistency  of  IV. 

To  see  whether  finite-sample  bias  may  also  be  a  problem  we  run  the  regression 
(4.58)  of  schooling  on  all  of  the  instruments.  Testing  the  joint  significance  of  the  three 
additional  instruments  yields  an  F-statistic  of  8.07,  suggesting  that  the  bias  of  IV  may 
be  10  or  20%  that  of  OLS.  A  similar  regression  for  the  other  two  endogenous  variables 
yields  much  higher  F-statistics  since,  for  example,  age  is  a  good  additional  instrument 
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for  experience.  Given  that  there  are  three  endogenous  regressors  it  is  actually  bet¬ 
ter  to  use  the  method  of  Stock  et  al.  (2002)  discussed  in  Section  4.9.4,  though  here  the 
problem  is  restricted  to  schooling  since  for  experience  and  experience  squared,  respec¬ 
tively,  Shea’s  partial  R 2  equals  0.0876  and  0.0138,  whereas  the  first-stage  F-statistics 
are  1,772  and  1,542. 

If  additional  instruments  are  available  then  the  model  becomes  overidentified  and 
standard  procedure  is  to  additionally  perform  a  test  of  overidentifying  restrictions  (see 
Section  8.4.4). 


4.10.  Practical  Considerations 

The  estimation  procedures  in  this  chapter  are  implemented  in  all  standard  economet¬ 
rics  packages  for  cross-section  data,  except  that  not  all  packages  implement  quantile 
regression.  Most  provide  robust  standard  errors  as  an  option  rather  than  the  default. 

The  most  difficult  estimator  to  apply  can  be  the  instrumental  variables  estimator,  as 
in  many  potential  applications  it  can  be  difficult  to  obtain  instruments  that  are  uncor¬ 
related  with  the  error  yet  reasonably  correlated  with  the  regressor  or  regressors  being 
instrumented.  Such  instruments  can  be  obtained  through  specification  of  a  complete 
structural  model,  such  as  a  simultaneous  equations  system.  Current  applied  research 
emphasizes  alternative  approaches  such  as  natural  experiments. 


4.11.  Bibliographic  Notes 

The  results  in  this  chapter  are  presented  in  many  first-year  graduate  texts,  such  as  those  by 
Davidson  and  MacKinnon  (2004),  Greene  (2003),  Hayashi  (2000),  Johnston  and  diNardo 
(1997),  Mittelhammer,  Judge,  and  Miller  (2000),  and  Ruud  (2000).  We  have  emphasized  re¬ 
gression  with  stochastic  regressors,  robust  standard  errors,  quantile  regression,  endogeneity, 
and  instrumental  variables. 

4.2  Manski  (1991 )  has  a  nice  discussion  of  regression  in  a  general  setting  that  includes  discus¬ 
sion  of  the  loss  functions  given  in  Section  4.2. 

4.3  The  returns  to  schooling  example  is  well  studied.  Angrist  and  Krueger  (1999)  and  Card 
(1999)  provide  recent  surveys. 

4.4  For  a  history  of  least  squares  see  Stigler  (1986).  The  method  was  introduced  by  Legendre 
in  1805.  Gauss  in  1810  applied  least  squares  to  the  linear  model  with  normally  distributed 
error  and  proposed  the  elimination  method  for  computation,  and  in  later  work  he  proposed 
the  theorem  now  called  the  Gauss-Markov  theorem.  Galton  introduced  the  concept  of  re¬ 
gression,  meaning  mean-reversion  in  the  context  of  inheritance  of  family  traits,  in  1887. 
For  an  early  “modern”  treatment  with  application  to  pauperism  and  welfare  availability  see 
Yule  (1897).  Statistical  inference  based  on  least-squares  estimates  of  the  linear  regression 
model  was  developed  most  notably  by  Fisher.  The  heteroskedastic-consistent  estimate  of 
the  variance  matrix  of  the  OLS  estimator,  due  to  White  (1980a)  building  on  earlier  work 
by  Eicker  (1963),  has  had  a  profound  impact  on  statistical  inference  in  microeconometrics 
and  has  been  extended  to  many  settings. 

4.6  Boscovich  in  1757  proposed  a  least  absolute  deviations  estimator  that  predates  least 
squares;  see  Stigler  (1986).  A  review  of  quantile  regression,  introduced  by  Koenker  and 
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Bassett  (1978),  is  given  in  Buchinsky  (1994).  A  more  elementary  exposition  is  given  in 
Koenker  and  Hallock  (2001). 

4.7  The  earliest  known  use  of  instrumental  variables  estimation  to  secure  identification  in  a 
simultaneous  equations  setting  was  by  Wright  (1928).  Another  oft-cited  early  reference  is 
Reiersol  ( 1941),  who  used  instrumental  variables  methods  to  control  for  measurement  error 
in  the  regressors.  Sargan  (1958)  gives  a  classic  early  treatment  of  IV  estimation.  Stock  and 
Trebbi  (2003)  provide  additional  early  references. 

4.8  Instrumental  variables  estimation  is  presented  in  econometrics  texts,  with  emphasis  on  al¬ 
gebra  but  not  necessarily  intuition.  The  method  is  widely  used  in  econometrics  because  of 
the  desirability  of  obtaining  estimates  with  a  causal  interpretation. 

4.9  The  problem  of  weak  instruments  was  drawn  to  the  attention  of  applied  researchers  by 
Nelson  and  Startz  (1990)  and  Bound  et  al.  (1995).  There  are  a  number  of  theoretical  an¬ 
tecedents,  most  notably  the  work  of  Nagar  (1959).  The  problem  has  dampened  enthusiasm 
for  IV  estimation,  and  small-sample  bias  owing  to  weak  instruments  is  currently  a  very 
active  research  topic.  Results  often  assume  iid  normal  errors  and  restrict  analysis  to  one 
endogenous  regressor.  The  survey  by  Stock  et  al.  (2002)  provides  many  references  with 
emphasis  on  weak  instrument  asymptotics.  It  also  briefly  considers  extensions  to  nonlinear 
models.  The  survey  by  Hahn  and  Hausman  (2003b)  presents  additional  methods  and  results 
that  we  have  not  reviewed  here.  For  recent  work  on  bias  in  standard  errors  see  Bond  and 
Windmeijer  (2002).  For  a  careful  application  see  C.-I.  Lee  (2001). 


- Exercises - 

4-1  Consider  the  linear  regression  model  y  =  x'i/3  +  u,  with  nonstochastic  regressors 
x,  and  error  u,  that  has  mean  zero  but  is  correlated  as  follows:  E [UjUj]  =  a2  if 
/  =  j,  E[UjUj]  =  pa 2  if  |/  -  y'|  =  1,  and  E[iv,iv;]  =  0  if  |/'  —  y |  >  1 .  Thus  errors  for 
immediately  adjacent  observations  are  correlated  whereas  errors  are  otherwise 
uncorrelated.  In  matrix  notation  we  have  y  =  X/3  +  u,  where  f l  =  E[uu'j.  For  this 
model  answer  each  of  the  following  questions  using  results  given  in  Section  4.4. 

(a)  Verify  that  TI  is  a  band  matrix  with  nonzero  terms  only  on  the  diagonal  and 
on  the  first  off-diagonal;  and  give  these  nonzero  terms. 

(b)  Obtain  the  asymptotic  distribution  of  /30|_s  using  (4.19). 

(c)  State  how  to  obtain  a  consistent  estimate  of  V[/3OLS]  that  does  not  depend  on 
unknown  parameters. 

(d)  Is  the  usual  OLS  output  estimate  s2(X,X)  1  a  consistent  estimate  of  V[/3OLS]? 

(e)  Is  White’s  heteroskedasticity  robust  estimate  of  V[/30LS]  consistent  here? 

4-2  Suppose  we  estimate  the  model  y  =  p  +  <4,  where  u,  ~  _/V[ 0,  a?]. 

(a)  Show  that  the  OLS  estimator  of  p  simplifies  to  p  =  y. 

(b)  Hence  directly  obtain  the  variance  of  y.  Show  that  this  equals  White’s  het- 
eroskedastic  consistent  estimate  of  the  variance  given  in  (4.21). 

4-3  Suppose  the  dgp  is  y  =  /30Xj  +  u,,  u =  x,e/,  x,  ~  )V[0, 1],  and  e,-  ~  A^O,  1],  As¬ 
sume  that  data  are  independent  over  /  and  that  x,  is  independent  of  e,-.  Note  that 
the  first  four  central  moments  of  Af[0,  a2]  are  0,  a2,  0,  and  3a4. 

(a)  Show  that  the  error  term  u ;  is  conditionally  heteroskedastic. 

(b)  Obtain  plim  N  ^  X.  [Hint:  Obtain  E[xf]  and  apply  a  law  of  large  numbers.] 
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(C)  Obtain  a-2  =  V[u,],  where  the  expectation  is  with  respect  to  all  stochastic  vari¬ 
ables  in  the  model. 

(d)  Obtain  plim  ACX'HoX  =  lim  AC1  E[X'120X],  where  fl0  =  Diag[V[u;|x,]j. 

(e)  Using  answers  to  the  preceding  parts  give  the  default  OLS  result  (4.22)  for  the 
variance  matrix  in  the  limit  distribution  of  sfN{pOLS  -  Po),  ignoring  potential 
heteroskedasticity.  Your  ultimate  answer  should  be  numerical. 

(f)  Now  give  the  variance  in  the  limit  distribution  of  */N(p ols  -  Po),  taking  ac¬ 
count  of  any  heteroskedasticity.  Your  ultimate  answer  should  be  numerical. 

(g)  Do  any  differences  between  answers  to  parts  (e)  and  (f)  accord  with  your 
prior  beliefs? 

4-4  Consider  the  linear  regression  model  with  scalar  regressor  y  =  px,  +  u,  with  data 
(y,  Xj)  iid  over  /  though  the  error  may  be  conditionally  heteroskedastic. 

(a)  Show  that  (p0LS  -  p)  =  (A/-1  £/  x?)”1  A/-1  £,  x,u,. 

(b)  Apply  Kolmogorov  law  of  large  numbers  (Theorem  A.8)  to  the  averages  of  xf 
and  XjUi  to  show  that  /Jqls  -4-  p.  State  any  additional  assumptions  made  on 
the  dgp  for  x,  and  u,. 

(c)  Apply  the  Lindeberg-Levy  central  limit  theorem  (Theorem  A. 14)  to  the  aver¬ 
ages  of  X, Uj  to  show  that  AC1  £,  X;U,/A/~2  £,■  E[u?xf]  -4  A/"[0, 1].  State  any 
additional  assumptions  made  on  the  dgp  for  x,  and  Uj. 

(d)  Use  the  product  limit  normal  rule  (Theorem  A.  17)  to  show  that  part  (c)  implies 
A/“1/2  £,■  XjUj  -4  A/"[0,  lim  A/-1  £,•  E[ufxf]j.  State  any  assumptions  made  on 
the  dgp  for  x,  and  u,. 

(e)  Combine  results  using  (2.14)  and  the  product  limit  normal  rule  (Theorem 
A. 17)  to  obtain  the  limit  distribution  of  p. 

4-5  Consider  the  linear  regression  model  y  =  X/3  +  u. 

(a)  Obtain  the  formula  for  (3  that  minimizes  Q(/3)  =  u'Wu,  where  W  is  of  full  rank. 
[Hint:  The  chain  rule  for  matrix  differentiation  for  column  vectors  x  and  z  is 
9f(x)/9x  =  (9z'/9x)  x  (9/(z)/9z),  for  f(x)  =  f(g{x))  =  f(z)  where  z  =g(x)]. 

(b)  Show  that  this  simplifies  to  the  OLS  estimator  if  W  =  I. 

(c)  Show  that  this  gives  the  GLS  estimator  if  W  =  fl  1 . 

(d)  Show  that  this  gives  the  2SLS  estimator  if  W  =  Z(Z'Z)_1Z'. 

4-6  Consider  IV  estimation  (Section  4.8)  of  the  model  y=  x'/3  +  u  using  instruments 
z  in  the  just-identified  case  with  Z  an  N  x  K  matrix  of  full  rank. 

(a)  What  essential  assumptions  must  z  satisfy  for  the  IV  estimator  to  be  consis¬ 
tent  for  /3?  Explain. 

(b)  Show  that  given  just  identification  the  2SLS  estimator  defined  in  (4.53)  re¬ 
duces  to  the  IV  estimator  given  in  (4.51). 

(c)  Give  a  real-world  example  of  a  situation  where  IV  estimation  is  needed  be¬ 
cause  of  inconsistency  of  OLS,  and  specify  suitable  instruments. 

4-7  (Adapted  from  Nelson  and  Startz,  1990.)  Consider  the  three-equation  model,  y  = 
px+  u\  x  =  Xu  +  e;  z=  ye  +  v,  where  the  mutually  independent  errors  u,  e,  and 
vare  iid  normal  with  mean  0  and  variances,  respectively,  a2,  cr2,  and  <r2. 

(a)  Show  that  plim(^0LS  -  p)  =  ^cr2/  (X2rr2  +  rr2). 

(b)  Show  that  p\z  =  yrr2/(A2cr2  +  rr2)(y2cr2  +  a%). 

(c)  Show  that  p iv  =  mzy/ =  p  +  mzu/  (Xmzu  +  m^),  where,  for  example,  m ^  = 
Ziiyi- 
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(d)  Show  that  pN  -  p  ->•  1/A.  as  y  (or  pxz)  0. 

(e)  Show  that  yS|V  —  p  — oo  as  mzu  ^ —yof /k. 

(f)  What  do  the  last  two  results  imply  regarding  finite-sample  biases  and  the 
moments  of  pN  -  p  when  the  instruments  are  poor? 

4-8  Select  a  50%  random  subsample  of  the  Section  4.6.4  data  on  log  health  expen¬ 
diture  (y)  and  log  total  expenditure  (x). 

(a)  Obtain  OLS  estimates  and  contrast  usual  and  White  standard  errors  for  the 
slope  coefficient. 

(b)  Obtain  median  regression  estimates  and  compare  these  to  the  OLS  esti¬ 
mates. 

(c)  Obtain  quantile  regression  estimates  for  q  =  0.25  and  q  =  0.75. 

(d)  Reproduce  Figure  4.2  using  your  answers  from  parts  (a)-(c). 

4-9  Select  a  50%  random  subsample  of  the  Section  4.9.6  data  on  earnings  and  edu¬ 
cation,  and  reproduce  as  much  of  Table  4.5  as  possible  and  provide  appropriate 
interpretation. 


115 


CHAPTER  5 


Maximum  Likelihood  and 
Nonlinear  Least-Squares 
Estimation 


5.1.  Introduction 

A  nonlinear  estimator  is  one  that  is  a  nonlinear  function  of  the  dependent  variable. 
Most  estimators  used  in  microeconometrics,  aside  from  the  OLS  and  IV  estimators  in 
the  linear  regression  model  presented  in  Chapter  4,  are  nonlinear  estimators.  Nonlin¬ 
earity  can  arise  in  many  ways.  The  conditional  mean  may  be  nonlinear  in  parameters. 
The  loss  function  may  lead  to  a  nonlinear  estimator  even  if  the  conditional  mean  is 
linear  in  parameters.  Censoring  and  truncation  also  lead  to  nonlinear  estimators  even 
if  the  original  model  has  conditional  mean  that  is  linear  in  parameters. 

Here  we  present  the  essential  statistical  inference  results  for  nonlinear  estimation. 
Very  limited  small-sample  results  are  available  for  nonlinear  estimators.  Statistical  in¬ 
ference  is  instead  based  on  asymptotic  theory  that  is  applicable  for  large  samples.  The 
estimators  commonly  used  in  microeconometrics  are  consistent  and  asymptotically 
normal. 

The  asymptotic  theory  entails  two  major  departures  from  the  treatment  of  the  linear 
regression  model  given  in  an  introductory  graduate  course.  First,  alternative  methods 
of  proof  are  needed  since  there  is  no  direct  formula  for  most  nonlinear  estimators. 
Second,  the  asymptotic  distribution  is  generally  obtained  under  the  weakest  distri¬ 
butional  assumptions  possible.  This  departure  was  introduced  in  Section  4.4  to  permit 
heteroskedasticity -robust  inference  for  the  OLS  estimator.  Under  such  weaker  assump¬ 
tions  the  default  standard  errors  reported  by  a  simple  regression  program  are  invalid. 
Some  care  is  needed,  however,  as  these  weaker  assumptions  can  lead  to  inconsistency 
of  the  estimator  itself,  a  much  more  fundamental  problem. 

As  much  as  possible  the  presentation  here  is  expository.  Definitions  of  conver¬ 
gence  in  probability  and  distribution,  laws  of  large  numbers  (LLN),  and  central  limit 
theorems  (CLT)  are  presented  in  many  texts,  and  here  these  topics  are  relegated  to 
Appendix  A.  Applied  researchers  rarely  aim  to  formally  prove  consistency  and  asymp¬ 
totic  normality.  It  is  not  unusual,  however,  to  encounter  data  applications  with  estima¬ 
tion  problems  sufficiently  recent  or  complex  as  to  demand  reading  recent  econometric 
journal  articles.  Then  familiarity  with  proofs  of  consistency  and  asymptotic  normality 
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is  very  helpful,  especially  to  obtain  a  good  idea  in  advance  of  the  likely  form  of  the 
variance  matrix  of  the  estimator. 

Section  5.2  provides  an  overview  of  key  results.  A  more  formal  treatment  of 
extremum  estimators  that  maximize  or  minimize  any  objective  function  is  given  in  Sec¬ 
tion  5.3.  Estimators  based  on  estimating  equations  are  defined  and  presented  in  Sec¬ 
tion  5.4.  Statistical  inference  based  on  robust  standard  errors  is  presented  briefly  in 
Section  5.5,  with  complete  treatment  deferred  to  Chapter  7.  Maximum  likelihood  es¬ 
timation  and  quasi-maximum  likelihood  estimation  are  presented  in  Sections  5.6  and 
5.7.  Nonlinear  least-squares  estimation  is  given  in  Section  5.8.  Section  5.9  presents  a 
detailed  example. 

The  remaining  leading  parametric  estimation  procedures  -  generalized  method 
of  moments  and  nonlinear  instrumental  variables  -  are  given  separate  treatment  in 
Chapter  6. 


5.2.  Overview  of  Nonlinear  Estimators 

This  section  provides  a  summary  of  asymptotic  properties  of  nonlinear  estimators, 
given  more  rigorously  in  Section  5.3,  and  presents  ways  to  interpret  regression  co¬ 
efficients  in  nonlinear  models.  The  material  is  essential  for  understanding  use  of  the 
cross-section  and  panel  data  models  presented  in  later  chapters. 


5.2.1.  Poisson  Regression  Example 

It  is  helpful  to  introduce  a  specific  example  of  nonlinear  estimation.  Here  we  consider 
Poisson  regression,  analyzed  in  more  detail  in  Chapter  20. 

The  Poisson  distribution  is  appropriate  for  a  dependent  variable  y  that  takes  only 
nonnegative  integer  values  0,  1,  2,  ....  It  can  be  used  to  model  the  number  of  occur¬ 
rences  of  an  event,  such  as  number  of  patent  applications  by  a  firm  and  number  of 
doctor  visits  by  an  individual. 

The  Poisson  density,  or  more  formally  the  Poisson  probability  mass  function,  with 
rate  parameter  A  is 

f(y\X)  =  e~x),>/y\,  y  =  0,1,2,..., 

where  it  can  be  shown  that  E[y]  =  A  and  V[y]  =  A. 

A  regression  model  specifies  the  parameter  A  to  vary  across  individuals  according 
to  a  specific  function  of  regressor  vector  x  and  parameter  vector  (3.  The  usual  Poisson 
specification  is 


A  =  exp(x'/3), 

which  has  the  advantage  of  ensuring  that  the  mean  A  >  0.  The  density  of  the  Poisson 
regression  model  for  a  single  observation  is  therefore 

f(y\x,(3)  =  e-«P<«73>exp(x73)y/y!.  (5.1) 
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Consider  maximum  likelihood  estimation  based  on  the  sample  {(y,-,  x,),i  = 
1, . . . ,  /V).  The  maximum  likelihood  (ML)  estimator  maximizes  the  log-likelihood 
function  (see  Section  5.6).  The  likelihood  function  is  the  joint  density,  which  given 
independent  observations  is  the  product  ]_[|.  /(y,jx,-,/3)  of  the  individual  densities, 
where  we  have  conditioned  on  the  regressors.  The  log-likelihood  function  is  then  the 
log  of  a  product,  which  equals  the  sum  of  logs,  or  In  /(y,  |x, ,  (3). 

For  the  Poisson  density  (5.1),  the  log-density  for  the  ith  observation  is 

ln/(y,jx,-,/3)  =  -  exp(x'/3j  +  )>,x'/3  -  Iny,!. 

So  the  Poisson  ML  estimator  (3  maximizes 

Quit 3)  =  { "  exP(^/3)  +  yrtP  ~  ln  A'/ ! }  •  (5.2) 

where  the  scale  factor  1  / N  is  included  so  that  Qn((3)  remains  finite  as  N  — »■  oo.  The 
Poisson  ML  estimator  is  the  solution  to  the  first-order  conditions  9<2aK/3)/9/3|;§  =  0, 
or 

Jj  Y^=i  (yi  -  exP (x'^))X;  =  0.  (5.3) 

There  is  no  explicit  solution  for  (3  in  (5.3).  Numerical  methods  to  compute  (3  are 
given  in  Chapter  10.  In  this  chapter  we  instead  focus  on  the  statistical  properties  of  the 
resulting  estimate  (3. 


5.2.2,  m-Estimators 


More  generally,  we  define  an  m-estimator  6  of  the  q  x  1  parameter  vector  0  as  an  esti¬ 
mator  that  maximizes  an  objective  function  that  is  a  sum  or  average  of  N  subfunctions 

QN(0)=^J2tiq(yi’Xi’G)’  (5-4) 

where  q(-)  is  a  scalar  function,  y,  is  the  dependent  variable,  x,  is  a  regressor  vector, 
and  the  results  in  this  section  assume  independence  over  i. 

For  simplicity  y,-  is  written  as  a  scalar,  but  the  results  extend  to  vector  y,  and  so 
cover  multivariate  and  panel  data  and  systems  of  equations.  The  objective  function  is 
subscripted  by  N  to  denote  that  it  depends  on  the  sample  data.  Throughout  the  book 
q  is  used  to  denote  the  dimension  of  6.  Note  that  here  q  is  additionally  being  used  to 
denote  the  subfunction  q(-)  in  (5.4). 

Many  econometrics  estimators  and  models  are  m-estimators,  corresponding  to  spe¬ 
cific  functional  forms  for  q(y,  x,  6).  Leading  examples  are  maximum  likelihood  (see 
(5.39)  later)  and  nonlinear  least  squares  (NLS)  (see  (5.67)  later).  The  Poisson  ML 
estimator  that  maximizes  (5.2)  is  an  example  of  (5.4)  with  6  =  (3  and  q(y,x,  (3)  = 
—  exp  (x'/3)  +  yx' (3  —  In  y !. 

We  focus  attention  on  the  estimator  0  that  is  computed  as  the  solution  to  the  asso¬ 
ciated  first-order  conditions  9  Q  v (6)/<)0\'0  =  0,  or  equivalently 


1  dq(yi,Xj,6) 

N  ^'=1  90 


(5.5) 
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This  is  a  system  of  q  equations  in  q  unknowns  that  generally  has  no  explicit  solution 
for  9. 

The  term  m-estimator,  attributed  to  Huber  (1967),  is  interpreted  as  an  abbrevia¬ 
tion  for  maximum-likelihood-like  estimator.  Many  econometrics  authors,  including 
Amemiya  (1985,  p.  105),  Greene  (2003,  p.  461),  and  Wooldridge  (2002,  p.  344),  define 
an  m-estimator  as  optimizing  over  a  sum  of  terms,  as  in  (5.4).  Other  authors,  including 
Serfling  (1980),  define  an  m-estimator  as  solutions  of  equations  such  as  (5.5).  Huber 
(1967)  considered  both  cases  and  Huber  (1981,  p.  43)  explicitly  defined  an  m-estimator 
in  both  ways.  In  this  book  we  call  the  former  type  of  estimator  an  m-estimator 
and  the  latter  an  estimating  equations  estimator  (which  will  be  treated  separately  in 
Section  5.4). 


5.2.3.  Asymptotic  Properties  of  m-Estimators 

The  key  desirable  asymptotic  properties  of  an  estimator  are  that  it  be  consistent  and 
that  it  have  an  asymptotic  distribution  to  permit  statistical  inference  at  least  in  large 
samples. 


Consistency 

The  first  step  in  determining  the  properties  of  6  is  to  define  exactly  what  6  is  intended 
to  estimate.  We  suppose  that  there  is  a  unique  value  of  6,  denoted  9a  and  called  the 
true  parameter  value,  that  generates  the  data.  This  identification  condition  (see  Sec¬ 
tion  2.5)  requires  both  correct  specification  of  the  component  of  the  dgp  of  interest  and 
uniqueness  of  this  representation.  Thus  for  the  Poisson  example  it  may  be  assumed  that 
the  dgp  is  one  with  Poisson  parameter  exp(x'/30)  and  x  is  such  that  x'/3(1)  =  x'/3'2’  if 
and  only  if/3(1)  =  (3{2). 

The  formal  notation  with  subscript  0  for  the  true  parameter  value  is  used  extensively 
in  Chapters  5  to  8.  The  motivation  is  that  9  can  take  many  different  values,  but  interest 
lies  in  two  particular  values  -  the  true  value  9»  and  the  estimated  value  9. 

The  estimate  9  will  never  exactly  equal  9q,  even  in  large  samples,  because  of  the 
intrinsic  randomness  of  a  sample.  Instead,  we  require  9  to  be  consistent  for  9{]  (see 
Definition  A.2  in  Appendix  A),  meaning  that  9  must  converge  in  probability  to  9q, 
denoted  9  -4-  9 o. 

Rigorously  establishing  consistency  of  m-estimators  is  difficult.  Formal  results  are 
given  in  Section  5.3.2  and  a  useful  informal  condition  is  given  in  Section  5.3.7.  Spe¬ 
cializations  to  ML  and  NLS  estimators  are  given  in  later  sections. 


Limit  Normal  Distribution 

Given  consistency,  as  N  — >  oo  the  estimator  9  has  a  distribution  with  all  mass  at  9{).  As 
for  OLS,  we  magnify  or  rescale  9  by  multiplication  by  \/N  to  obtain  a  random  variable 
that  has  nondegenerate  distribution  as  N  — >  oo.  Statistical  inference  is  then  conducted 
assuming  N  is  large  enough  for  asymptotic  theory  to  provide  a  good  approximation, 
but  not  so  large  that  9  collapses  on  9q. 
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We  therefore  consider  the  behavior  of  >/N(6  —  Go).  For  most  estimators  this  has  a 
finite-sample  distribution  that  is  too  complicated  to  use  for  inference.  Instead,  asymp¬ 
totic  theory  is  used  to  obtain  the  limit  of  this  distribution  as  N  — >  oo.  For  most  microe¬ 
conometrics  estimators  this  limit  is  the  multivariate  normal  distribution.  More  formally 
s/lv (6  —  Gq)  converges  in  distribution  to  the  multivariate  normal,  where  convergence 
in  distribution  is  defined  in  Appendix  A. 

Recall  from  Section  4.4  that  the  OLS  estimator  can  be  expressed  as 


vw-/3o>=(4y;” 


-1 


1  %  \N 

7^  ^,=1 


X/  U ! 


and  the  limit  distribution  was  derived  by  obtaining  the  probability  limit  of  the  first  term 
on  the  right-hand  side  and  the  limit  normal  distribution  of  the  second  term.  The  limit 
distribution  of  an  m-estimator  is  obtained  in  a  similar  way.  In  Section  5.3.3  we  show 
that  for  an  estimator  that  solves  (5.5)  we  can  always  write 


VN(0  -  G0)  = 


/  1  d2qi(G) 

1  1  3  qi(G) 

\iV  4=1  dQdQ' 

8+) 

(5.6) 


where  qSG)  =  q(y, ,  x, ,  G),  for  some  G 1  between  6  and  G{),  provided  second  derivatives 
and  the  inverse  exist.  This  result  is  obtained  by  a  Taylor  series  expansion. 

Under  appropriate  assumptions  this  yields  the  following  limit  distribution  of  an 
m-estimator: 


Vn(G  -  Go)  4  Af[0,  Aq 'B0Aq ‘], 


(5.7) 


where  Aq  1  is  the  probability  limit  of  the  first  term  in  the  right-hand  side  of  (5.6),  and 
the  second  term  is  assumed  to  converge  to  the  A^[0,  B()]  distribution.  The  expressions 
for  Aq  and  Bq  are  given  in  Table  5.1. 


Asymptotic  Normality 

To  obtain  the  distribution  of  G  from  the  limit  distribution  result  (5.7),  divide  the  left- 
hand  side  of  (5.7)  by  */N  and  hence  divide  the  variance  by  N.  Then 

?~^[0o,V[?]],  (5.8) 

where  ~  means  “is  asymptotically  distributed  as,”  and  V[0]  denotes  the  asymptotic 
variance  of  G  with 

V[?]  =  A-'Ao'BoAq1.  (5.9) 

A  complete  discussion  of  the  term  asymptotic  distribution  has  already  been  given  in 

Section  4.4.4,  and  is  also  given  in  Section  A.6.4. 

The  result  (5.9)  depends  on  the  unknown  true  parameter  Go.  It  is  implemented  by 
computing  the  estimated  asymptotic  variance 

V[?]=A~1A'1BA"1,  (5.10) 

where  A  and  B  are  consistent  estimates  of  Aq  and  Bq. 
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Table  5.1.  Asymptotic  Properties  of  m-Estimators 


Property" 

Algebraic  Formula 

Objective  function 

Qn{9)  =  A-1  £;  q(yi,  x i>  0)  is  maximized  wrt  9 

Examples 

ML:  q;  =  In  f(y,  x, ,  0)  is  the  log-density 

NLS:  qt  =  —(>’,■  —  ,g(x, .  0))2  is  minus  the  squared  error 

First-order  conditions 

dQsieydO  =  A-1  £,=,  dq(yi,  x,-,  0)/80\d  =  0. 

Consistency 

Is  plim  Qn(9)  maximized  at  6  —  OqI 

Consistency  (informal) 

Does  E[3<7(y,-,  x,-,  9)/d9\e^\  =  0? 

Limit  distribution 

x/A(?  -  0O)  4  Af[0,  Aq 'BoAg '] 

A0  =  plimA-1  £f=1  d2qi{0)/dddd'\9o 

B0  =  plimA-1  £f=1  dqi/dOxdqi/d0'\eo . 

Asymptotic  distribution 

6~M[60,  A-'A-^A'1] 
a  =  a-'£4,  d2qi(9)/dGdd'\d 

B^A-'Efl,  dqi/d0xdqi/30'\d 

a  The  limit  distribution  variance  and  asymptotic  variance  estimate  are  robust  sandwich  forms  that  assume 
independence  over  i.  See  Section  5.5.2  for  other  variance  estimates. 


The  default  output  for  many  econometrics  packages  instead  often  uses  a  simpler 
estimate  V[0]  =  —  that  is  only  valid  in  some  special  cases.  See  Section  5.5 

for  further  discussion,  including  various  ways  to  estimate  Ao  and  Bo  and  then  perform 
hypothesis  tests. 

The  two  leading  examples  of  m-estimators  are  the  ML  and  the  NLS  estimators. 
Formal  results  for  these  estimators  are  given  in,  respectively,  Propositions  5.5  and  5.6. 
Simpler  representations  of  the  asymptotic  distributions  of  these  estimators  are  given 
in,  respectively,  (5.48)  and  (5.77). 

Poisson  ML  Example 

Like  other  ML  estimators,  the  Poisson  ML  estimator  is  consistent  if  the  density  is 
correctly  specified.  However,  applying  (5.25)  from  Section  5.3.7  to  (5.3)  reveals  that 
the  essential  condition  for  consistency  is  actually  the  weaker  condition  that  E[y|x]  = 
exp(x'/30),  that  is,  correct  specification  of  the  mean.  Similar  robustness  of  the  ML 
estimator  to  partial  misspecification  of  the  distribution  holds  for  some  other  special 
cases  detailed  in  Section  5.7. 

For  the  Poisson  ML  estimator  dq(/3)/d(3  =  (y  —  exp(x'/30))x,  leading  to 
A0  =  —  plim  A-1  ^4  exp(xj/30)x,xj 

and 

B0  =  plim  A-1  V[j;|x/]x,x'. 

Then  /3  ~  Af[9o, A_1A_1BA_1],  where  A  =  —  N~l  £,•  exp(xj/3)x,xj  and  B  = 
A”1  J2i(yt  ~  exp (x'3))2x,x'. 
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Table  5.2.  Marginal  Effect:  Three  Different  Estimates 


Formula 

Description 

N -1  £/  3E[y,|x,]/3x, 

3E[y|x]/3x|x 

3E[y|x]/3x|x, 

Average  response  of  all  individuals 

Response  of  the  average  individual 

Response  of  a  representative  individual  with  x  =  x* 

If  the  data  are  actually  Poisson  distributed,  then  V[y  |x]  =  E [  y  |  x  |  =  cxptx'/30),  lead¬ 
ing  to  possible  simplification  since  Ao  =  —Bo  so  that  A()  1  BoA(|  1  =  —  A0  1 .  However, 
in  most  applications  with  count  data  V[y|x]  >  E[y|x],  so  it  is  best  not  to  impose  this 
restriction. 


5.2.4.  Coefficient  Interpretation  in  Nonlinear  Regression 

An  important  goal  of  estimation  is  often  prediction,  rather  than  testing  the  statistical 
significance  of  regressors. 


Marginal  Effects 

Interest  often  lies  in  measuring  marginal  effects,  the  change  in  the  conditional  mean 
of  y  when  regressors  x  change  by  one  unit. 

For  the  linear  regression  model,  E[y|x]  =  x  fi  implies  3E[v|x]/3x  =  (3  so  that  the 
coefficient  has  a  direct  interpretation  as  the  marginal  effect.  For  nonlinear  regression 
models,  this  interpretation  is  no  longer  possible.  For  example,  if  E[y|x]  =  exptx'/d), 
then  3E[y|x]/3x  =  exp(x'/3)/3  is  a  function  of  both  parameters  and  regressors,  and  the 
size  of  the  marginal  effect  depends  on  x  in  addition  to  (3. 

General  Regression  Function 
For  a  general  regression  function 


E[.v|x]  =g(x,  /3), 

the  marginal  effect  varies  with  the  evaluation  value  of  x. 

It  is  customary  to  present  one  of  the  estimates  of  the  marginal  effect  given  in 
Table  5.2.  The  first  estimate  averages  the  marginal  effects  for  all  individuals.  The  sec¬ 
ond  estimate  evaluates  the  marginal  effect  at  x  =  x.  The  third  estimate  evaluates  at 
specific  characteristics  x  =  x*.  For  example,  x*  may  represent  a  person  who  is  female 
with  12  years  of  schooling  and  so  on.  More  than  one  representative  individual  might  be 
considered. 

These  three  measures  differ  in  nonlinear  models,  whereas  in  the  linear  model  they 
all  equal  /3.  Even  the  sign  of  the  effect  may  be  unrelated  to  the  sign  of  the  pa¬ 
rameter,  with  3E[y[x]/3xj  positive  for  some  values  of  x  and  negative  for  other  val¬ 
ues  of  x.  Considerable  care  must  be  taken  in  interpreting  coefficients  in  nonlinear 
models. 
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Computer  programs  and  applied  studies  often  report  the  second  of  these  measures. 
This  can  be  useful  in  getting  a  sense  for  the  magnitude  of  the  marginal  effect,  but 
policy  interest  usually  lies  in  the  overall  effect,  the  first  measure,  or  the  effect  on  a 
representative  individual  or  group,  the  third  measure.  The  first  measure  tends  to  change 
relatively  little  across  different  choices  of  functional  form  g(-),  whereas  the  other  two 
measures  can  change  considerably.  One  can  also  present  the  full  distribution  of  the 
marginal  effects  using  a  histogram  or  nonparametric  density  estimate. 


Single-Index  Models 

Direct  interpretation  of  regression  coefficients  is  possible  for  single-index  models  that 
specify 

E[y|x]  =  g(x'f3),  (5.11) 


so  that  the  data  and  parameters  enter  the  nonlinear  mean  function  «(■)  by  way  of  the 
single  index  x'/3.  Then  nonlinearity  is  of  the  mild  form  that  the  mean  is  a  nonlinear 
function  of  a  linear  combination  of  the  regressors  and  parameters.  For  single-index 
models  the  effect  on  the  conditional  mean  of  a  change  in  the  j th  regressor  using  cal¬ 
culus  methods  is 


9E[y|x] 

dxJ 


where  g'(z)  =  dg(z)/dz-  It  follows  that  the  relative  effects  of  changes  in  regressors 
are  given  by  the  ratio  of  the  coefficients  since 

9E[.v|x]/3x;  =  fij 
3E[y|x]/9xi  pk  ’ 

because  the  common  factor  g'(x'(3)  cancels.  Thus  if  is  two  times  /J*  then  a  one- 
unit  change  in  x;  has  twice  the  effect  as  a  one-unit  change  in  x/,.  If  g(-)  is  additionally 
monotonic  then  it  follows  that  the  signs  of  the  coefficients  give  the  signs  of  the  effects, 
for  all  possible  x. 

Single-index  models  are  advantageous  owing  to  their  simple  interpretation.  Many 
standard  nonlinear  models  such  as  logit,  probit,  and  Tobit  are  of  single-index  form. 
Moreover,  some  choices  of  function  g(-)  permit  additional  interpretation,  notably  the 
exponential  function  considered  later  in  this  section  and  the  logistic  cdf  analyzed  in 
Section  14.3.4. 


Finite-Difference  Method 


We  have  emphasized  the  use  of  calculus  methods.  The  finite-difference  method  in¬ 
stead  computes  the  marginal  effect  by  comparing  the  conditional  mean  when  xj  is 
increased  by  one  unit  with  the  value  before  the  increase.  Thus 


AE[y|x] 

A  XJ 


g(x  +  e, ,  (3)  -  g(x,  (3), 


where  c,  is  a  vector  with  j  th  entry  one  and  other  entries  zero. 
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For  the  linear  model  finite-difference  and  calculus  methods  lead  to  the  same  es¬ 
timated  effects,  since  AE[)jx]/ Axy  =(x'/3  +  /3; )  —  x' ft  =  fij .  For  nonlinear  models, 
however,  the  two  approaches  give  different  estimates  of  the  marginal  effect,  unless  the 
change  in  Xj  is  infinitesimally  small. 

Often  calculus  methods  are  used  for  continuous  regressors  and  finite-difference 
methods  are  used  for  integer-valued  regressors,  such  as  a  (0,  1 )  indicator  variable. 


Exponential  Conditional  Mean 

As  an  example,  consider  coefficient  interpretation  for  an  exponential  conditional  mean 
function,  so  that  E[y|x]  =  cxp(x'ft).  Many  count  and  duration  models  use  the  expo¬ 
nential  form. 

A  little  algebra  yields  9E[y|x]/3xy  =E[y|x]  x  f> , .  So  the  parameters  can  be  inter¬ 
preted  as  semi-elasticities,  with  a  one-unit  change  in  xj  increasing  the  conditional 
mean  by  the  multiple  fij.  For  example,  if  (ft  =  0.2  then  a  one-unit  change  in  xj 
is  predicted  to  lead  to  a  0.2  times  proportionate  increase  in  E[y|x],  or  an  increase 
of  20%. 

If  instead  the  finite-difference  method  is  used,  the  marginal  effect  is  computed  as 
AE[ y  |x]/Ax;-  =  expix'/d  +  fi  ft  —  cxp(x'ft)  =  exp (x! ft){e^  —  1).  This  differs  from  the 
calculus  result,  unless  /3j  is  small  so  that  e^J  —  I  +  For  example,  if  ftj  =  0.2  the 
increase  is  22.14%  rather  than  20%. 


5.3.  Extremum  Estimators 

This  section  is  intended  for  use  in  an  advanced  graduate  course  in  microeconomet¬ 
rics.  It  presents  the  key  results  on  consistency  and  asymptotic  normality  of  extremum 
estimators,  a  very  general  class  of  estimators  that  minimize  or  maximize  an  objective 
function.  The  presentation  is  very  condensed.  A  more  complete  understanding  requires 
an  advanced  treatment  such  as  that  in  Amemiya  (1985),  the  basis  of  the  treatment  here, 
or  in  Newey  and  McFadden  (1994). 


5.3.1.  Extremum  Estimators 

For  cross-section  analysis  of  a  single  dependent  variable  the  sample  is  one  of  N  ob¬ 
servations,  {()’,■ ,  X,-),  i  =  1 . N},  on  a  dependent  variable  y, ,  and  a  column  vector 

x,  of  regressors.  In  matrix  notation  the  sample  is  (y,  X),  where  y  is  an  N  x  1  vector 
with  / th  entry  y,  and  X  is  a  matrix  with  / th  row  xj,  as  defined  more  completely  in 
Section  1.6. 

Interest  lies  in  estimating  the  q  x  1  parameter  vector  6  =  [9\. . . .  6q]' .  The  value 
6q,  termed  the  true  parameter  value,  is  the  particular  value  of  6  in  the  process  that 
generated  the  data,  called  the  data-generating  process. 

We  consider  estimators  6  that  maximize  over  6  e  0  the  stochastic  objective  func¬ 
tion  Qn(9)  =  Qn( y,  X,  B),  where  for  notational  simplicity  the  dependence  of  Qn(B ) 
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on  the  data  is  indicated  only  via  the  subscript  N.  Such  estimators  are  called  extremum 
estimators,  since  they  solve  a  maximization  or  minimization  problem. 

The  extremum  estimator  may  be  a  global  maximum,  so 

9  —  argmaxee0  QN{9).  (5.12) 


Usually  the  extremum  estimator  is  a  local  maximum,  computed  as  the  solution  to  the 
associated  first-order  conditions 


3Qn(0) 

d6 


(5.13) 


where  dQN(9)/d9  is  a  q  x  1  column  vector  with  Ath  entry  dQN(9)/d9k .  The  lo¬ 
cal  maximum  is  emphasized  because  it  is  the  local  maximum  that  may  be  asymp¬ 
totic  normal  distributed.  The  local  and  global  maxima  coincide  if  Qn(O)  is  globally 
concave. 

There  are  two  leading  examples  of  extremum  estimators.  For  m-estimators  consid¬ 
ered  in  this  chapter,  notably  ML  and  NLS  estimators,  Q  v  ( 9 )  is  a  sample  average  such 
as  average  of  squared  residuals.  For  the  generalized  method  of  moments  estimator  (see 
Section  6.3)  Qn(O)  is  a  quadratic  form  in  sample  averages. 

For  concreteness  the  discussion  focuses  on  single-equation  cross-section  regression. 
But  the  results  are  quite  general  and  apply  to  any  estimator  based  on  optimization  that 
satisfies  properties  given  in  this  section.  In  particular  there  is  no  restriction  to  a  scalar 
dependent  variable  and  several  authors  use  the  notation  z,  in  place  of  (  v, .  x,).  Then 
Qn(6)  equals  Qn( Z,  9)  rather  than  Qn( y,  X,  9). 


5.3.2.  Formal  Consistency  Theorems 

We  first  consider  parameter  identification,  introduced  in  Section  2.5.  Intuitively  the 
parameter  9q  is  identified  if  the  distribution  of  the  data,  or  feature  of  the  distribution  of 
interest,  is  determined  by  9q  whereas  any  other  value  of  9  leads  to  a  different  distribu¬ 
tion.  For  example,  in  linear  regression  we  required  E[y|X]  =  X/30  and  X/3(1)  =  X/3(2) 
if  and  only  if  /3(1)  =  (3(2\ 

An  estimation  procedure  may  not  identify  Oq.  For  example,  this  is  the  case  if  the  es¬ 
timation  procedure  omits  some  relevant  regressors.  We  say  that  an  estimation  method 
identifies  9q  if  the  probability  limit  of  the  objective  function,  taken  with  respect  to 
the  dgp  with  parameter  9  =  Oq,  is  maximized  uniquely  at  9  =  9{).  This  identification 
condition  is  an  asymptotic  one.  Practical  estimation  problems  that  can  arise  in  a  finite 
sample  are  discussed  in  Chapter  10. 

Consistency  is  established  in  the  following  manner.  As  N  — >  oo  the  stochastic  ob¬ 
jective  function  Qn(0),  an  average  in  the  case  of  m-estimation,  may  converge  in  prob¬ 
ability  to  a  limit  function,  denoted  Qo(9),  that  in  the  simplest  case  is  nonstochas¬ 
tic.  The  corresponding  maxima  (global  or  local)  of  Qn(9)  and  Qo(9)  should  then 
occur  for  values  of  9  close  to  each  other.  Since  the  maximum  of  Qn(0)  is  9  by 
definition,  it  follows  that  9  converges  in  probability  to  9q  provided  9q  maximizes 
(MO). 
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Clearly,  consistency  and  identification  are  closely  related,  and  Amemiya  (1985, 
p.  230)  states  that  a  simple  approach  is  to  view  identification  to  mean  existence  of  a 
consistent  estimator.  For  further  discussion  see  Newey  and  McFadden  (1994,  p.  2124) 
and  Deistler  and  Seifert  (1978). 

Key  applications  of  this  approach  include  Jennrich  (1969)  and  Amemiya  (1973). 
Amemiya  (1985)  and  Newey  and  McFadden  (1994)  present  quite  general  theorems. 
These  theorems  require  several  assumptions,  including  smoothness  (continuity)  and 
existence  of  necessary  derivatives  of  the  objective  function,  assumptions  on  the  dgp 
to  ensure  convergence  of  Qn(9)  to  Q»(d),  and  maximization  of  Qo(d)  at  d  =  9q. 
Different  consistency  theorems  use  slightly  different  assumptions. 

We  present  two  consistency  theorems  due  to  Amemiya  (1985),  one  for  a  global 
maximum  and  one  for  a  local  maximum.  The  notation  in  Amemiya’s  theorems  has 
been  modified  as  Amemiya  (1985)  defines  the  objective  function  without  the  normal¬ 
ization  l/N  present  in,  for  example,  (5.4). 

Theorem  5.1  (Consistency  of  Global  Maximum)  ( Amemiya ,  1985,  Theo¬ 
rem  4.1.1 ):  Make  the  following  assumptions: 

(i)  The  parameter  space  ©  is  a  compact  subset  of  Rq . 

(ii)  The  objective  function  Qn(9)  is  a  measurable  function  of  the  data  for  all  de 
0,  and  Qn(0)  is  continuous  in  6  €  ©. 

(Hi)  Qn(Q)  converges  uniformly  in  probability  to  a  nonstochastic  function  Qo(d), 
and  Qo(0)  attains  a  unique  global  maximum  at  0q. 

Then  the  estimator  6  =  arg  maxgg©  Q  v  ( 0 )  is  consistent  for  do,  that  is,  d  do- 

Uniform  convergence  in  probability  of  Qn(9)  to 

Qo(d)  =  plim  Qn(B)  (5.14) 

in  condition  (iii)  means  that  supe€0  |  Quid)  —  Qo(d)\  — '>  0. 

For  a  local  maximum,  first  derivatives  need  to  exist,  but  one  need  then  only  consider 
the  behavior  of  Quid)  and  its  derivative  in  the  neighborhood  of  00- 

Theorem  5.2  (Consistency  of  Local  Maximum)  (Amemiya,  1985,  Theo¬ 
rem  4.1.2):  Make  the  following  assumptions: 

(i)  The  parameter  space  ©  is  an  open  subset  of  Rq . 

(ii)  Qn(0)  is  a  measurable  function  of  the  data  for  all  d  e  ©,  and  <tQ^(d)/hd 
exists  and  is  continuous  in  an  open  neighborhood  ofd0. 

(iii)  The  objective  function  Qn(S)  converges  uniformly  in  probability  to  Qo(d)  in 
an  open  neighborhood  of  do,  and  Qo(9)  attains  a  unique  local  maximum  at  do- 

Then  one  of  the  solutions  to  d(J\i(d)/dd  =  0  is  consistent  for  do. 

An  example  of  use  of  Theorem  5.2  is  given  later  in  Section  5.3.4. 
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Condition  (i)  in  Theorem  5.1  permits  a  global  maximum  to  be  at  the  boundary  of  the 
parameter  space,  whereas  in  Theorem  5.2  a  local  maximum  has  to  be  in  the  interior  of 
the  parameter  space.  Condition  (ii)  in  Theorem  5.2  also  implies  continuity  of  Qn(9) 
in  the  open  neighborhood  of  9q,  where  a  neighborhood  N(9q)  of  is  open  if  and 
only  if  there  exists  a  ball  with  center  9q  entirely  contained  in  N(9q).  In  both  theorems 
condition  (iii)  is  the  essential  condition.  The  maximum,  global  or  local,  of  Qo(G)  must 
occur  at  6  =  9{].  The  second  part  of  (iii)  provides  the  identification  condition  that  90 
has  a  meaningful  interpretation  and  is  unique. 

For  a  local  maximum,  analysis  is  straightforward  if  there  is  only  one  local  maxi¬ 
mum.  Then  6  is  uniquely  defined  by  3  Qn(G)/39\q  =  0.  When  there  is  more  than  one 
local  maximum,  the  theorem  simply  says  that  one  of  the  local  maxima  is  consistent, 
but  no  guidance  is  given  as  to  which  one  is  consistent.  It  is  best  in  such  cases  to  con¬ 
sider  the  global  maximum  and  apply  Theorem  5.1.  See  Newey  and  McFadden  (1994, 
p.  2117)  for  a  discussion. 

An  important  distinction  is  made  between  model  specification,  reflected  in  the 
choice  of  objective  function  Qn(G),  and  the  actual  dgp  of  (y,  X)  used  in  obtaining 
Qo(9)  in  (5.14).  For  some  dgps  an  estimator  may  be  consistent,  whereas  for  other  dgps 
an  estimator  may  be  inconsistent.  In  some  cases,  such  as  the  Poisson  ML  and  OLS  es¬ 
timators,  consistency  arises  under  a  wide  range  of  dgps  provided  the  conditional  mean 
is  correctly  specified.  In  other  cases  consistency  requires  stronger  assumptions  on  the 
dgp  such  as  correct  specification  of  the  density. 


5.3.3.  Asymptotic  Normality 

Results  on  asymptotic  normality  are  usually  restricted  to  the  local  maximum  of  Qn(6). 
Then  9  solves  (5.13),  which  in  general  is  nonlinear  in  9  and  has  no  explicit  solution 
for  9.  Instead,  we  replace  the  left-hand  side  of  this  equation  by  a  linear  function  of  9, 
by  use  of  a  Taylor  series  expansion,  and  then  solve  for  9. 

The  most  often  used  version  of  Taylor’s  theorem  is  an  approximation  with  a  re¬ 
mainder  term.  Here  we  instead  consider  an  exact  first-order  Taylor  expansion.  For 
the  differentiable  function  /(•)  there  always  exists  a  point  x+  between  x  and  xq  such 
that 

/(*)  =  fix o)  +  f'(x+)(x  -  .Vo), 

where  fix)  =  dfix)/dx  is  the  derivative  of  fix).  This  result  is  also  known  as  the 

mean  value  theorem. 

Application  to  the  current  setting  requires  several  changes.  The  scalar  function  /(•) 
is  replaced  by  a  vector  function  f(-)  and  the  scalar  arguments  x,  Xo,  and  x+  are  replaced 
by  the  vectors  9,  6 o,  and  9+.  Then 

iG-90),  (5.15) 

9 + 

where  8f(9)/d9  is  a  matrix,  for  some  unknown  9+  between  9  and  9 o,  and  formally 
9+  differs  for  each  row  of  this  matrix  (see  Newey  and  McFadden,  1994,  p.  2141). 
For  the  local  extremum  estimator  the  function  f (9)=  dQN(9)/d9  is  already  a  first 


m  =  f(0o) 


3f(0) 


89' 
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derivative.  Then  an  exact  first-order  Taylor  series  expansion  around  d0  yields 


dQN(0) 

80  g 


3Qn(0)  d2QN(0 ) 

3d  H  dew 


(d-d  o), 

0+ 


(5.16) 


where  d2 QN(9)/d0dd'  is  a  q  x  q  matrix  with  ( j,  k)th  entry  32 <2iv(d)/3d jddk,  and 
G+  is  a  point  between  6  and  do. 

The  first-order  conditions  set  the  left-hand  side  of  (5.16)  to  zero.  Setting  the  right- 
hand  side  to  0  and  solving  for  ( 6  —  do)  yields 


Vn(G  -  d0)  = 


92£y(d)  \  1  /~\r  96 At(d) 

dGdO'  g+ )  3d 


(5.17) 


where  we  rescale  by  >/A  to  ensure  a  nondegenerate  limit  distribution  (discussed  fur¬ 
ther  in  the  following). 

Result  (5.17)  provides  a  solution  for  d.  It  is  of  no  use  for  numerical  computation 
of  d,  since  it  depends  on  d0  and  d+,  both  of  which  are  unknown,  but  it  is  fine  for 
theoretical  analysis.  In  particular,  if  it  has  been  established  that  d  is  consistent  for  d0 
then  the  unknown  d+  converges  in  probability  to  do,  because  it  lies  between  d  and  G0 
and  by  consistency  d  converges  in  probability  to  do. 

The  result  (5.17)  expresses  >/N (d  —  do)  in  a  form  similar  to  that  used  to  obtain  the 
limit  distribution  of  the  OLS  estimator  (see  Section  5.2.3).  All  we  need  do  is  assume 
a  probability  limit  for  the  first  term  on  the  right-hand  side  of  (5.17)  and  a  limit  normal 
distribution  for  the  second  term. 

This  leads  to  the  following  theorem,  from  Amemiya  (1985),  for  an  extremum  esti¬ 
mator  satisfying  a  local  maximum.  Again  note  that  Amemiya  (1985)  defines  the  ob¬ 
jective  function  without  the  normalization  1/A.  Also,  Amemiya  defines  Ao  and  B0  in 
terms  of  limE  rather  than  plim. 


Theorem  5.3  (Limit  Distribution  of  Local  Maximum)  (Amemiya,  1985,  The¬ 
orem  4.1.3):  In  addition  to  the  assumptions  of  the  preceding  theorem  for  consis¬ 
tency  of  the  local  maximum  make  the  following  assumptions: 

(i)  322jv(d)/3d3d'  exists  and  is  continuous  in  an  open  convex  neighborhood  of 

do- 

(ii)  32<2w(d)/3d3d,|e+  converges  in  probability  to  the  finite  nonsingular  matrix 

A0  =  plim  322A,(d)/3d3d,|eo  (5.18) 

for  any  sequence  G+  such  that  G+  -a-  Gq. 

(in)  VN  dQN(G)/dG\So  4  7V[0,  B0],  where 

B0  =  plim  [a  dQN(G)/dG  x  dQN(0)/dG’  |  J  .  (5.19) 

Then  the  limit  distribution  of  the  extremum  estimator  is 

Vn(G  -  do)  4  Af[0,  Aq'BqAo1],  (5.20) 

where  the  estimator  G  is  the  consistent  solution  to  3  Q  v  (d ) /  3  d  =  0. 
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The  proof  follows  directly  from  the  Limit  Normal  Product  Rule  (Theorem  A.  17) 
applied  to  (5.17).  Note  that  the  proof  assumes  that  consistency  of  6  has  already  been 
established.  The  expressions  for  Ao  and  Bo  given  in  Table  5.1  are  specializations  to  the 
case  Qn{0)  =  A  1  JT  qSQ)  with  independence  over  i. 

The  probability  limits  in  (5.18)  and  (5.19)  are  obtained  with  respect  to  the  dgp 
for  (y,  X).  In  some  applications  the  regressors  are  assumed  to  be  nonstochastic  and 
the  expectation  is  with  respect  to  y  only.  In  other  cases  the  regressors  are  treated  as 
stochastic  and  the  expectations  are  then  with  respect  to  both  y  and  X. 

5.3.4.  Poisson  ML  Estimator  Asymptotic  Properties  Example 

We  formally  prove  consistency  and  asymptotic  normality  of  the  Poisson  ML  estimator, 
under  exogenous  stratified  sampling  with  stochastic  regressors  so  that  (y,-,  x,  )  are  inid, 
without  necessarily  assuming  that  y,  is  Poisson  distributed. 

The  key  step  to  prove  consistency  is  to  obtain  Qo(j3)  =  plim  Qn((3 )  and  verify  that 
Qo(j3 )  attains  a  maximum  at  /3  =  f30.  For  Qn(/3)  defined  in  (5.1),  we  have 

Qo((3)  =  plimA-1  {— +  y^/3  -  In y, ! } 

=  limiV"1  j-E^]  +E[y(x'/3]  -  E[lny,!]} 

=  lim  A”1  {-E[ex^]  +E  [V^0 x[f3  —  E  [In  v, !]}  . 

The  second  equality  assumes  a  law  of  large  numbers  can  be  applied  to  each  term.  Since 
(y,- ,  x,  )  are  inid,  the  Markov  LLN  (Theorem  A.  8)  can  be  applied  if  each  of  the  expected 
values  given  in  the  second  line  exists  and  additionally  the  corresponding  (1  +  <5)th 
absolute  moment  exists  for  some  8  >  0  and  the  side  condition  given  in  Theorem  A.  8 
is  satisfied.  For  example,  set  8  =  1  so  that  second  moments  are  used.  The  third  line 
requires  the  assumption  that  the  dgp  is  such  that  E[y|x]  =exp(x'/30).  The  first  two 
expectations  in  the  third  line  are  with  respect  to  x,  which  is  stochastic.  Note  that  Qo(j3 ) 
depends  on  both  /3  and  (30.  Differentiating  with  respect  to  (3 ,  and  assuming  that  limits, 
derivatives,  and  expectations  can  be  interchanged,  we  get 

3  =  -  lim  A'1  E  [^n]  +  lim  A-1  E  [V^x,]  , 

where  the  derivative  of  E[ln  y!]  with  respect  to  /3  is  zero  since  E[ln  y!]  will  depend 
on  (3q ,  the  true  parameter  value  in  the  dgp,  but  not  on  (3.  Clearly,  dQo(j3)/d/3  =  0  at 
(3  =  (3q  and  d2Qo(j3)/d/3df3'  =  —  lim  A-1  JTE[exp(x^/3)x,-x^]  is  negative  definite,  so 
Qo(j3)  attains  a  local  maximum  at  f3  =  [3{)  and  the  Poisson  ML  estimator  is  consistent 
by  Theorem  5.2.  Since  here  Qn((3)  is  globally  concave  the  local  maximum  equals  the 
global  maximum  and  consistency  can  also  be  established  using  Theorem  5.1. 

For  asymptotic  normality  of  the  Poisson  ML  estimator,  the  exact  first-order  Taylor 
series  expansion  of  the  Poisson  ML  estimator  first-order  conditions  (5.3)  yields 

Va(3  -p0)  =  -  [-A-1  ex:/3+x,X;]~'  A-1/2  £.(*■  -  eW°)xh  (5.21) 
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for  some  unknown  /3+  between  (3  and  /30.  Making  sufficient  assumptions  on  regressors 
x  so  that  the  Markov  LLN  can  be  applied  to  the  first  term,  and  using  f3+  4  (30  since 
/ 3  —>  (30,  we  have 


-N 


~l  T,:  ex^+x,x-  4  A0  =  —  lim  N  1  ^4  E [e^x/x']. 


(5.22) 


For  the  second  term  in  (5.21)  begin  by  assuming  scalar  regressor  x.  Then  X  =  (y  — 
cxp(x  j'H)))x  has  mean  E[X]  =  0,  as  E[y|x]  =  exp(x/6o)  has  already  been  assumed  for 
consistency,  and  variance  V[A]  =E[V[v|x]x2].  The  Liapounov  CLT  (Theorem  A. 15) 
can  be  applied  if  the  side  condition  involving  a  (2  +  <i )th  absolute  moment  of  y  — 
exp(xfio))x  is  satisfied.  For  this  example  with  y  >  0  it  is  sufficient  to  assume  that  the 
third  moment  of  y  exists,  that  is,  8  =  1 ,  and  x  is  bounded.  Applying  the  CLT  gives 


T,i  (y-  -  e/3oXi)xi 

JyzlviyAxilx?] 


4  Af[0,  1], 


so 

N~1/2  y.  (>’«  -  e^x‘)Xi  4  A f  [o,  lim  AT1  y  E  [V[y,  |x,]x,-2]]  , 

assuming  the  limit  in  the  expression  for  the  asymptotic  variance  exists.  This  result  can 
be  extended  to  the  vector  regressor  case  using  the  Cramer-Wold  device  (see  Theo¬ 
rem  A.  16).  Then 


N~1/2  E,(>'  -  eX'i0o)x‘  ^  U  [°-  B<>  =  y, E  [V[yI  |Ai]x,x']]  .  (5.23) 


Thus  (5.21)  yields  \/~N(j3  —  (30)  4  Af[0,  Aq  *BoAq  1  ] ,  where  Ao  is  defined  in  (5.22) 
and  Bq  is  defined  in  (5.23). 

Note  that  for  this  particular  example  y |x  need  not  be  Poisson  distributed  for  the 
Poisson  ML  estimator  to  be  consistent  and  asymptotically  normal.  The  essential  as¬ 
sumption  for  consistency  of  the  Poisson  ML  estimator  is  that  the  dgp  is  such  that 
E[y|x]  =exp(x'/30). 

For  asymptotic  normality  the  essential  assumption  is  that  V[y|x]  exists,  though 
additional  assumptions  on  existence  of  higher  moments  are  needed  to  permit  use 
of  LLN  and  CLT.  If  in  fact  V[y|x]  =exp(x'/30)  then  Ao=  —Bo  and  more  simply 
V~N(j3  —  (3q)  4  A/"[0,  —  A()  1 1.  The  results  for  this  ML  example  extend  to  the  LEF 
class  of  densities  defined  in  Section  5.7.3. 


5.3.5.  Proofs  of  Consistency  and  Asymptotic  Normality 

The  assumptions  made  in  Theorems  5. 1-5.3  are  quite  general  and  need  not  hold  in 
every  application.  These  assumptions  need  to  be  verified  on  a  case-by-case  basis,  in  a 
manner  similar  to  the  preceding  Poisson  ML  estimator  example.  Here  we  sketch  out 
details  for  m-estimators. 

For  consistency,  the  key  step  is  to  obtain  the  probability  limit  of  Q:\  (0).  This  is 
done  by  application  of  an  LLN  because  for  an  m-estimator  Qn(Q)  is  the  average 
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N  1  q,(9).  Different  assumptions  on  the  dgp  lead  to  the  use  of  different  LLNs 

and  more  substantively  to  different  expressions  for  Qo(9). 

Asymptotic  normality  requires  assumptions  on  the  dgp  in  addition  to  those  required 
for  consistency.  Specifically,  we  need  assumptions  on  the  dgp  to  enable  application  of 
an  LLN  to  obtain  Ao  and  to  enable  application  of  a  CLT  to  obtain  Bo. 

For  an  m-estimator  an  LLN  is  likely  to  verify  condition  (ii)  of  Theorem  5.3  as  each 
entry  in  the  matrix  d2QN(9)/d986'  is  an  average  since  Q  v ( 9 )  is  an  average.  A  CLT 
is  likely  to  yield  condition  (iii)  of  Theorem  5.3,  since  sf~N  8  Q n (9}/80\On  has  mean 
0  from  the  informal  consistency  condition  (5.24)  in  Section  5.3.7  and  finite  variance 
E [N  dQN(9)/89  x  8QN(9)/89' |j. 

The  particular  CLT  and  LLN  used  to  obtain  the  limit  distribution  of  the  estimator 
vary  with  assumptions  about  the  dgp  for  (y,  X).  In  all  cases  the  dependent  variable  is 
stochastic.  However,  the  regressors  may  be  fixed  or  stochastic,  and  in  the  latter  case 
they  may  exhibit  time-series  dependence.  These  issues  have  already  been  considered 
for  OLS  in  Section  4.4.7. 

The  common  microeconometrics  assumption  is  that  regressors  are  stochastic  with 
independence  across  observations,  which  is  reasonable  for  cross-section  data  from  na¬ 
tional  surveys.  For  simple  random  sampling,  the  data  (y,-,  x,)  are  iid  and  Kolmogorov 
LLN  and  Lindeberg-Levy  CLT  (Theorems  A.8  and  A.  14)  can  be  used.  Furthermore, 
under  simple  random  sampling  (5.18)  and  (5.19)  then  simplify  to 


92g(y.x,  9) 
8989' 


and 


8q(y,  x,  9)  8q(y,  x,  9) 
86  89' 


where  (y ,  x)  denotes  a  single  observation  and  expectations  are  with  respect  to  the  joint 
distribution  of  (y,  x).  This  simpler  notation  is  used  in  several  texts. 

For  stratified  random  sampling  and  for  fixed  regressors  the  data  (y, ,  x,  )  are  inid  and 
Markov  LLN  and  Liapounov  CLT  (Theorems  A. 9  and  A.  15)  need  to  be  used.  These 
require  moment  assumptions  additional  to  those  made  in  the  iid  case.  In  the  stochastic 
regressors  case,  expectations  are  with  respect  to  the  joint  distribution  of  (y,  x),  whereas 
in  the  fixed  regressors  case,  such  as  in  a  controlled  experiment  where  the  level  of  x  can 
be  set,  the  expectations  in  (5.18)  and  (5.19)  are  with  respect  to  y  only. 

For  time-series  data  the  regressors  are  assumed  to  be  stochastic,  but  they  are  also 
assumed  to  be  dependent  across  observations,  a  necessary  framework  to  accommo¬ 
date  lagged  dependent  variables.  Hamilton  (1994)  focuses  on  this  case,  which  is  also 
studied  extensively  by  White  (2001a).  The  simplest  treatments  restrict  the  random  vari¬ 
ables  (y,  x)  to  have  stationary  distribution.  If  instead  the  data  are  nonstationary  with 
unit  roots  then  rates  of  convergence  may  no  longer  be  \Tn  and  the  limit  distributions 
may  be  nonnormal. 

Despite  these  important  conceptual  and  theoretical  differences  about  the  stochastic 
nature  of  (y,  x),  however,  for  cross-section  regression  the  eventual  limit  theorem  is 
usually  of  the  general  form  given  in  Theorem  5.3. 
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5.3.6.  Discussion 

The  form  of  the  variance  matrix  in  (5.20)  is  called  the  sandwich  form,  with  Bo  sand¬ 
wiched  between  Aq  1  and  A{|  1 .  The  sandwich  form,  introduced  in  Section  4.4.4,  will 
be  discussed  in  more  detail  in  Section  5.5.2. 

The  asymptotic  results  can  be  extended  to  inconsistent  estimators.  Then  9q  is  re¬ 
placed  by  the  pseudo-true  value  9*,  defined  to  be  that  value  of  9  that  yields  the  local 
maximum  of  Qq{9).  This  is  considered  in  further  detail  for  quasi-ML  estimation  in 
Section  5.7.1.  In  most  cases,  however,  the  estimator  is  consistent  and  in  later  chapters 
the  subscript  0  is  often  dropped  to  simplify  notation. 

In  the  preceding  results  the  objective  function  Qn(9 )  is  initially  defined  with  nor¬ 
malization  by  1  /N,  the  first  derivative  of  Q  y  ( 9 )  is  then  normalized  by  s/N,  and  the 
second  derivative  is  not  normalized,  leading  to  a  \Tn -consistent  estimator.  In  some 
cases  alternative  normalizations  may  be  needed,  most  notably  time  series  with  nonsta¬ 
tionary  trend. 

The  results  assume  that  Qn(9)  is  a  continuous  differentiable  function.  This 
excludes  some  estimators  such  as  least  absolute  deviations,  for  which  Qn(9)  = 
N  1  JT  [ y,  —  x./3|.  One  way  to  proceed  in  this  case  is  to  obtain  a  differentiable  ap¬ 
proximating  function  Q*N(9 )  such  that  Q*N(9)  —  Qn(9)  -4-  0  and  apply  the  preceding 
theorem  to  Q*N(9). 

The  key  component  to  obtaining  the  limit  distribution  is  linearization  using  a  Taylor 
series  expansion.  Taylor  series  expansions  can  be  a  poor  global  approximation  to  a 
function.  They  work  well  in  the  statistical  application  here  as  the  approximation  is 
asymptotically  a  local  one,  since  consistency  implies  that  for  large  sample  sizes  9  is 
close  to  the  point  of  expansion  9().  More  refined  asymptotic  theory  is  possible  using  the 
Edgeworth  expansion  (see  Section  1 1.4.3).  The  bootstrap  (see  Chapter  1 1)  is  a  method 
to  empirically  implement  an  Edgeworth  expansion. 


5.3.7.  Informal  Approach  to  Consistency  of  an  m-Estimator 


For  the  practitioner  the  limit  normal  result  of  Theorem  5.3  is  much  easier  to  prove  than 
formal  proof  of  consistency  using  Theorem  5.1  or  5.2.  Here  we  present  an  informal 
approach  to  determining  the  nature  and  strength  of  distributional  assumptions  needed 
for  an  m-estimator  to  be  consistent. 

For  an  m-estimator  that  is  a  local  maximum,  the  first-order  conditions  (5.4)  imply 
that  9  is  chosen  so  that  the  average  of  dqj(9)/d9\g  equals  zero.  Intuitively,  a  necessary 
condition  for  this  to  yield  a  consistent  estimator  for  9t)  is  that  in  the  limit  the  average 
of  dq(9)/d9\go  goes  to  0,  or  that 


plim 


3  Qn(9) 

d9 


00 


(5.24) 


where  the  first  equality  requires  the  assumption  that  a  law  of  large  numbers  can  be 
applied  and  expectation  in  (5.24)  is  taken  with  respect  to  the  population  dgp  for  (y,  X). 
The  limit  is  used  as  the  equality  need  not  be  exact,  provided  any  departure  from  zero 
disappears  as  N  — >  oo.  For  example,  consistency  should  hold  if  the  expectation  equals 
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1  /N.  The  condition  (5.24)  provides  a  very  useful  check  for  the  practitioner.  An  infor¬ 
mal  approach  to  consistency  is  to  look  at  the  first-order  conditions  for  the  estimator 
6  and  determine  whether  in  the  limit  these  have  expectation  zero  when  evaluated  at 
9  =  0O. 

Even  less  formally,  if  we  consider  the  components  in  the  sum,  the  essential  condi¬ 
tion  for  consistency  is  whether  for  the  typical  observation 

E[39(0)/30|oo]=O.  (5.25) 

This  condition  can  provide  a  very  useful  guide  to  the  practitioner.  However,  it  is  neither 
a  necessary  nor  a  sufficient  condition.  If  the  expectation  in  (5.25)  equals  1  / N  then  it 
is  still  likely  that  the  probability  limit  in  (5.24)  equals  zero,  so  the  condition  (5.25)  is 
not  necessary.  To  see  that  it  is  not  sufficient,  consider  y  iid  with  mean  /iq  estimated 
using  just  one  observation,  say  the  first  observation  y\.  Then  fi  solves  y  —  ji  =  0  and 
(5.25)  is  satisfied.  But  clearly  y\  -£>  /io  as  the  single  observation  y\  has  a  variance  that 
does  not  go  to  zero.  The  problem  is  that  here  the  plim  in  (5.24)  does  not  equal  limE. 
Formal  proof  of  consistency  requires  use  of  theorems  such  as  Theorem  5.1  or  5.2. 

For  Poisson  regression  use  of  (5.25)  reveals  that  the  essential  condition  for  consis¬ 
tency  is  correct  specification  of  the  conditional  mean  of  y  |x  (see  Section  5.2.3).  Simi¬ 
larly,  the  OLS  estimator  solves  N~x  xi(yi  ~  x-/3)  =  0,  so  from  (5.25)  consistency 
essentially  requires  that  E[x(y  —  x'/3o)]  =  0-  This  condition  fails  if  E[ y | x  ]  /  x'/30, 
which  can  happen  for  many  reasons,  as  given  in  Section  4.7.  In  other  examples  use 
of  (5.25)  can  indicate  that  consistency  will  require  considerably  more  parametric  as¬ 
sumptions  than  correct  specification  of  the  conditional  mean. 

To  link  use  of  (5.24)  to  condition  (iii)  in  Theorem  5.2,  note  the  following: 


aeo(0)/30  =  0 
=>•  3 (plim  QN(9))/d6  =  0 
=>•  3(limE [QN(6)])/dd  =  0 
=>•  HmdE[QN(9)]/d9  =  0 
=>•  limE  [dQN(0)/d6]  =  0 


(condition  (iii)  in  Theorem  5.2) 

(from  definition  of  Qq{9)) 

(as  anLLN  =>•  Qq  —  plunge  =  limE[<2;v]) 
(interchanging  limits  and  differentiation),  and 
(interchanging  differentiation  and  expectation). 


The  last  line  is  the  informal  condition  (5.24).  However,  obtaining  this  result  re¬ 
quires  additional  assumptions,  including  restriction  to  local  maximum,  application 
of  a  law  of  large  numbers,  interchangeability  of  limits  and  differentiation,  and  in¬ 
terchangeability  of  differentiation  and  expectation  (i.e.,  integration).  In  the  scalar 
case  a  sufficient  condition  for  interchanging  differentiation  and  limits  is  liny^o 
(E  [Qn(9  +  h)\  -  E[gjv(0)]) /h=dE[QN(9)]/dd  uniformly  in  9. 


5.4.  Estimating  Equations 

The  derivation  of  the  limit  distribution  given  in  Section  5.3.3  can  be  extended  from  a 
local  extremum  estimator  to  estimators  defined  as  being  the  solution  of  an  estimating 
equation  that  sets  an  average  to  zero.  Several  examples  are  given  in  Chapter  6. 
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5.4.1.  Estimating  Equations  Estimator 
Let  d  be  defined  as  the  solution  to  the  system  of  q  estimating  equations 

h*(0)  =  Ef=,  =  0,  (5.26) 

where  h(-)  is  a  q  x  1  vector,  and  independence  over  i  is  assumed.  Examples  of  h(-)  are 
given  later  in  Section  5.4.2. 

Since  d  is  chosen  so  that  the  sample  average  of  h(y,  x,  d)  equals  zero,  we  expect  that 
d  —>  60  if  in  the  limit  the  average  of  h(y,  x,  do)  goes  to  zero,  that  is,  if  plim  h  v($o)  = 
0.  If  an  LLN  can  be  applied  this  requires  that  limE[hv(^o)|  =  0,  or  more  loosely  that 
for  the  / th  observation 


E[h(y, ,  x; ,  #(,)]  =  0. 


(5.27) 


The  easiest  way  to  formally  establish  consistency  is  actually  to  derive  (5.26)  as  the 
first-order  conditions  for  an  m-estimator. 

Assuming  consistency,  the  limit  distribution  of  the  estimating  equations  estimator 
can  be  obtained  in  the  same  manner  as  in  Section  5.3.3  for  the  extremum  estimator. 
Take  an  exact  first-order  Taylor  series  expansion  of  hAr(0)  around  do,  as  in  (5.15)  with 
{(d)  =  h ,y (d),  and  set  the  right-hand  side  to  0  and  solve.  Then 


VN(d  -  do)  = 


ShN(d) 

dd' 


-1 


VNhN(d0). 


(5.28) 


This  leads  to  the  following  theorem. 


Theorem  5.4  (Limit  Distribution  of  Estimating  Equations  Estimator): 

Assume  that  the  estimating  equations  estimator  that  solves  (5.26)  is  consistent 
for  do  and  make  the  following  assumptions: 


(i)  3h N(d)/dd'  exists  and  is  continuous  in  an  open  convex  neighborhood  of  do- 

(ii)  dhj^(d)/dd'\g+  converges  in  probability  to  the  finite  nonsingular  matrix 


Ao  =  plim 


9h, v(0) 
dd' 


do 


1  3h;(6») 

PlimT?L,=i 


dd' 


(5.29) 


for  any  sequence  d  1  such  that  d  1  — >  do- 
(Hi)  VAhJV(6»0)  4  Af[0,  B0],  where 

Bo  =  plimATuvtfl.filLv^o)'  =  plim4  ^=l  Mdo)hj(d0)'.  (5.30) 
Then  the  limit  distribution  of  the  estimating  equations  estimator  is 

VN(d  —  d0)  4  Af[0,  Aq 'BqAq-1],  (5.31) 


where,  unlike  for  the  extremum  estimator,  the  matrix  Ao  may  not  be  symmetric 
since  it  is  no  longer  necessarily  a  Hessian  matrix. 
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This  theorem  can  be  proved  by  adaptation  of  Amemiya’s  proof  of  Theorem  5.3. 
Note  that  Theorem  5.4  assumes  that  consistency  has  already  been  established. 

Godambe  (1960)  showed  that  for  analysis  conditional  on  regressors  the  most  effi¬ 
cient  estimating  equations  estimator  sets  h,(#)  =  9  In  /(_y,  |x, ,  9 )/'<)9.  Then  (5.26)  are 
the  first-order  conditions  for  the  ML  estimator. 


5.4.2.  Analogy  Principle 

The  analogy  principle  uses  population  conditions  to  motivate  estimators.  The  book 
by  Manski  (1988a)  emphasizes  the  importance  of  the  analogy  principle  as  a  unify¬ 
ing  theme  for  estimation.  Manski  (1988a,  p.  xi)  provides  the  following  quote  from 
Goldberger  (1968,  p.  4): 

The  analogy  principle  of  estimation  . . .  proposes  that  population  parameters  be 

estimated  by  sample  statistics  which  have  the  same  property  in  the  sample  as  the 

parameters  do  in  the  population. 

Analogue  estimators  are  estimators  obtained  by  application  of  the  analogy  prin¬ 
ciple,  Population  moment  conditions  suggest  as  estimator  the  solution  to  the  corre¬ 
sponding  sample  moment  condition. 

Extremum  estimator  examples  of  application  of  the  analogy  principle  have  been 
given  in  Section  4.2.  For  instance,  if  the  goal  of  prediction  is  to  minimize  expected 
loss  in  the  population  and  squared  error  loss  is  used,  then  the  regression  parameters  (3 
are  estimated  by  minimizing  the  sample  sum  of  squared  errors. 

Method  of  moments  estimators  are  also  examples.  For  instance,  in  the  iid  case  if 
E[  v;  —  /i]  =  0  in  the  population  then  we  use  as  estimator  /i  that  solves  the  correspond¬ 
ing  sample  moment  conditions  N  1  JT(y,-  —  /i)  =  0,  leading  to  ji  =  y.  the  sample 
mean. 

An  estimating  equations  estimator  may  be  motivated  as  an  analogue  estimator.  If 
(5.27)  holds  in  the  population  then  estimate  9  by  solving  the  corresponding  sample 
moment  condition  (5.26). 

Estimating  equations  estimators  are  extensively  used  in  microeconometrics.  The 
relevant  theory  can  be  subsumed  within  that  for  generalized  method  of  moments, 
presented  in  the  next  chapter,  which  is  an  extension  that  permits  there  to  be  more 
moment  conditions  than  parameters.  In  applied  statistics  the  approach  is  used  in  the 
context  of  generalized  estimating  equations. 


5.5.  Statistical  Inference 

A  detailed  treatment  of  hypothesis  tests  and  confidence  intervals  is  given  in  Chapter  7. 
Here  we  outline  how  to  test  linear  restrictions,  including  exclusion  restrictions,  using 
the  most  common  method,  the  Wald  test  for  estimators  that  may  be  nonlinear.  Asymp¬ 
totic  theory  is  used,  so  formal  results  lead  to  chi-square  and  normal  distributions  rather 
than  the  small  sample  F-  and  /-distributions  from  linear  regression  under  normality. 
Moreover,  there  are  several  ways  to  consistently  estimate  the  variance  matrix  of  an 
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extremum  estimator,  leading  to  alternative  estimates  of  standard  errors  and  associated 
test  statistics  and  p-values. 


5.5.1.  Wald  Hypothesis  Tests  of  Linear  Restrictions 

Consider  testing  h  linearly  independent  restrictions,  say  Hq  against  Ha,  where 

H0  :  R .0„  -  r  =  0, 

Ha  :  R60  -  r  ^  0, 

with  Ran/i  x  q  matrix  of  constants  and  r  an  h  x  I  vector  of  constants.  For  example, 
if  6  =  [ 0\ ,  #2 ,  63]  then  to  test  whether  9\o  —  620  =  2,  R  =  [1,  —  1 ,  0]  and  r  =  —2. 

The  Wald  test  rejects  Hq  if  R0  —  r,  the  sample  estimate  of  R0q  —  r,  is  signifi¬ 
cantly  different  from  0.  This  requires  knowledge  of  the  distribution  of  R#  —  r.  Sup¬ 
pose  \/~N(9  —  6q)  -4>  7V[0,  C0],  where  Co=  Aq  'BoAq  1  from  (5.20).  Then 

?~A/'[0o,A-1Co], 

so  that  under  Hq  the  linear  combination 

R0  —  r  ~  U  [0,  R(A_1Co)R'] , 

where  the  mean  is  zero  since  R#o  —  r  =  0  under  Hq. 


Chi-Square  Tests 

It  is  convenient  to  move  from  the  multivariate  normal  distribution  to  the  chi-square 
distribution  by  taking  the  quadratic  form.  This  yields  the  Wald  statistic 

W=  (R0  -  r)'  (RCA-'CjR')'1  (R0  -  r)  4  x2(/i)  (5.32) 

under  Hq,  where  R(/V  " 1  Co)R/  is  of  full  rank  h  under  the  assumption  of  linearly  inde¬ 
pendent  restrictions,  and  C  is  a  consistent  estimator  of  Co-  Large  values  of  W  lead  to 
rejection,  and  Hq  is  rejected  at  level  a  if  W  >  X2(/i)  and  is  not  rejected  otherwise. 

Practitioners  frequently  instead  use  the  F-statistic  F  =  W/h.  Inference  is  then  based 
on  the  F(h,  N  —  q)  distribution  in  the  hope  that  this  might  provide  a  better  finite  sam¬ 
ple  approximation.  Note  that  h  times  the  F(h,  N)  distribution  converges  to  the  X2(/?) 
distribution  as  N  00. 

The  replacement  of  Co  by  C  in  obtaining  (5.32)  makes  no  difference  asymptotically, 
but  in  finite  samples  different  C  will  lead  to  different  values  of  W.  In  the  case  of 
classical  linear  regression  this  step  corresponds  to  replacing  a2  by  s2.  Then  W/h  is 
exactly  F  distributed  if  the  errors  are  normally  distributed  (see  Section  7.2.1). 


Tests  of  a  Single  Coefficient 

Often  attention  is  focused  on  testing  difference  from  zero  of  a  single  coefficient,  say  the 
/di  coefficient.  Then  R0  —  r  =  9j  and  W  =  9F/(N  'c;/),  where  c/;  is  the  jth  diagonal 
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element  in  C.  Taking  the  square  root  of  W  yields 

t  =  4  7V[0,  1]  (5.33) 

se[9j] 

under  Ho,  where  se[(9 ;]  =  y/N~1'cjj  is  the  asymptotic  standard  error  of  0 j.  Large  val¬ 
ues  of  /  lead  to  rejection,  and  unlike  W  the  statistic  /  can  be  used  for  one-sided  tests. 

Formally  VW  is  an  asymptotic  "-statistic,  but  we  use  the  notation  /  as  it  yields 
the  usual  “/-statistic,”  the  estimate  divided  by  its  standard  error.  In  finite  samples, 
some  statistical  packages  use  the  standard  normal  distribution  whereas  others  use  the 
/-distribution  to  compute  critical  values,  ;>values,  and  confidence  intervals.  Neither  is 
exactly  correct  in  finite  samples,  except  in  the  very  special  case  of  linear  regression 
with  errors  assumed  to  be  normally  distributed,  in  which  case  the  /-distribution  is 
exact.  Both  lead  to  the  same  results  in  infinitely  large  samples  as  the  /-distribution 
then  collapses  to  the  standard  normal. 


5.5.2.  Variance  Matrix  Estimation 

There  are  many  possible  ways  to  estimate  Aq  'BqAq-1  ,  because  there  are  many  ways  to 
consistently  estimate  Ao  and  Bo.  Thus  different  econometrics  programs  should  give  the 
same  coefficient  estimates  but,  quite  reasonably,  can  give  standard  errors,  /-statistics, 
and  /;- values  that  differ  in  finite  samples.  It  is  up  to  the  practitioner  to  determine  the 
method  used  and  the  strength  of  the  associated  distributional  assumptions  on  the  dgp. 


Sandwich  Estimate  of  the  Variance  Matrix 

The  limit  distribution  of  s/~N  ( 6  —  6q)  has  variance  matrix  A0  1  B0A(j  '  •  It  follows  that 
6  has  asymptotic  variance  matrix  N~lAf  'BoAq-1,  where  division  by  N  arises  because 
we  are  considering  9  rather  than  \/~N(9  —  So). 

A  sandwich  estimate  of  the  asymptotic  variance  of  6  is  any  estimate  of  the  form 

V[?]  =  /V_1A_1BA'_1,  (5.34) 

where  A  is  consistent  for  Ao  and  B  is  consistent  for  B().  This  is  called  the  sandwich 
form  since  B  is  sandwiched  between  A  1  and  A'  ~ 1 .  For  many  estimators  A  is  a 
Hessian  matrix  so  A  -1  is  symmetric,  but  this  need  not  always  be  the  case. 

A  robust  sandwich  estimate  is  a  sandwich  estimate  where  the  estimate  B  is  con¬ 
sistent  for  Bo  under  relatively  weak  assumptions.  It  leads  to  what  are  termed  robust 
standard  errors.  A  leading  example  is  White’s  heteroskedastic-consistent  estimate  of 
the  variance  matrix  of  the  OLS  estimator  (see  Section  4.4.5).  In  various  specific  con¬ 
texts,  detailed  in  later  sections,  robust  sandwich  estimates  are  called  Huber  estimates, 
after  Huber  (1967);  Eicker-White  estimates,  after  Eicker  (1967)  and  White  (1980a, b, 
1982);  and  in  stationary  time-series  applications  Newey-West  estimates,  after  Newey 
and  West  (1987b). 
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Estimation  of  A  and  B 


Here  we  present  different  estimators  for  Ao  and  Bo  for  both  the  estimating  equa¬ 
tions  estimator  that  solves  hw(0)  =  0  and  the  local  extremum  estimator  that  solves 

dQN(0)/d0\d=  0- 

Two  standard  estimates  of  Aq  in  (5.29)  and  (5.18)  are  the  Hessian  estimate 


aiLvlfl)  =  82QN(6) 

86'  g  8686' 


(5.35) 


where  the  second  equality  explains  the  use  of  the  term  Hessian,  and  the  expected 
Hessian  estimate 


Aeh 


"31Uv(0)- 

—  F 

[320iv(0)] 

86' 

—  JL_/ 

0 

8686' 

6 

(5.36) 


The  first  is  analytically  simpler  and  potentially  relies  on  fewer  distributional  assump¬ 
tions;  the  latter  is  more  likely  to  be  negative  definite  and  invertible. 

For  Bo  in  (5.30)  or  (5.19)  it  is  not  possible  to  use  the  obvious  estimate 
Ahjv(0)lUv(0y,  since  this  equals  zero  as  6  is  defined  to  satisfy  h^(0)  =  0.  One  es¬ 
timate  is  to  make  potentially  strong  distributional  assumptions  to  get 


Be  =  E[Ahjv(0)hjv(0)']|j  =  E 


N 


<8Qn(0)  8Qn{6) 


86 


86' 


(5.37) 


Weaker  assumptions  are  possible  for  m-estimators  and  estimating  equations  estimators 
with  data  independent  over  i.  Then  (5.30)  simplifies  to 


B0  =  E 


since  independence  implies  that,  for  i  j,  E[h,h/]  =  E[h,]E[h/],  which  in  turn 
equals  zero  given  E[h,(0)]  =  0.  This  leads  to  the  outer  product  (OP)  estimate  or 
BHHH  estimate  (after  Berndt,  Hall,  Hall,  and  Hausman,  1974) 


s-=vEL d“m 


86 


3  qi(6) 


86' 


(5.38) 


Bop  requires  fewer  assumptions  than  BE. 

In  practice  a  degrees  of  freedom  adjustment  is  often  used  in  estimating  Bo,  with 
division  in  (5.38)  for  Bop  by  (N  —  q)  rather  than  N,  and  similar  multiplication  of  BE 
in  (5.37)  by  N/(N  —  q).  There  is  no  theoretical  justification  for  this  adjustment  in 
nonlinear  models,  but  in  some  simulation  studies  this  adjustment  leads  to  better  finite- 
sample  performance  and  it  does  coincide  with  the  degrees  of  freedom  adjustment  made 
for  OLS  with  homoskedastic  errors.  No  similar  adjustment  is  made  for  AH  or  AEH. 

Simplification  occurs  in  some  special  cases  with  Ao  =  —  Bo.  Leading  examples  are 
OLS  or  NLS  with  homoskedastic  errors  (see  Section  5.8.3)  and  maximum  likelihood 
with  correctly  specified  distribution  (see  Section  5.6.4).  Then  either  —  A-1  or  B  1  may 
be  used  to  estimate  the  variance  of  sJ~N{6  —  6q).  These  estimates  are  less  robust  to 
misspecification  of  the  dgp  than  those  using  the  sandwich  form.  Misspecification  of 
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the  dgp,  however,  may  additionally  lead  to  inconsistency  of  9,  in  which  case  even 
inference  based  on  the  robust  sandwich  estimate  will  be  invalid. 

For  the  Poisson  example  of  Section  5.2,  Ah  =  Aeh  =  —  N~l  exp(x'/3)x,x'  and 
Bop  =  (N  —  q)~l  J2i(yi  -  exp(x^))2x,x'  .  If  V[y|x]  =  exp(x730),  the  case  if  y|x  is 
actually  Poisson  distributed,  then  Be  =  —[N/(N  —  <?)]Aeh  and  simplification  occurs. 


5.6.  Maximum  Likelihood 

The  ML  estimator  holds  special  place  among  estimators.  It  is  the  most  efficient  estima¬ 
tor  among  consistent  asymptotically  normal  estimators.  It  is  also  important  pedagog- 
ically,  as  many  methods  for  nonlinear  regression  such  as  m-estimation  can  be  viewed 
as  extensions  and  adaptations  of  results  first  obtained  for  ML  estimation. 


5.6.1.  Likelihood  Function 
The  Likelihood  Principle 

The  likelihood  principle,  due  to  R.  A.  Fisher  (1922),  is  to  choose  as  estimator  of  the 
parameter  vector  9t)  that  value  of  6  that  maximizes  the  likelihood  of  observing  the  ac¬ 
tual  sample.  In  the  discrete  case  this  likelihood  is  the  probability  obtained  from  the 
probability  mass  function;  in  the  continuous  case  this  is  the  density.  Consider  the  dis¬ 
crete  case.  If  one  value  of  9  implies  that  the  probability  of  the  observed  data  occurring 
is  .0012,  whereas  a  second  value  of  9  gives  a  higher  probability  of  .0014,  then  the 
second  value  of  9  is  a  better  estimator. 

The  joint  probability  mass  function  or  density  /( y,  \\9)  is  viewed  here  as  a  func¬ 
tion  of  9  given  the  data  (y,  X).  This  is  called  the  likelihood  function  and  is  denoted 
by  L,v(0|y,  X).  Maximizing  Ln(9)  is  equivalent  to  maximizing  the  log-likelihood 
function 


CN(9)  =lnLN(9). 

We  take  the  natural  logarithm  because  in  application  this  leads  to  an  objective  function 
that  is  the  sum  rather  than  the  product  of  N  terms. 


Conditional  Likelihood 

The  likelihood  function  L N(9)  =/( y,  \\9)  =/(y|X,  9)  f(X\9)  requires  specification 
of  both  the  conditional  density  of  y  given  X  and  the  marginal  density  of  X. 

Instead,  estimation  is  usually  based  on  the  conditional  likelihood  function 
L n{9)  =f( y|X,  9),  since  the  goal  of  regression  is  to  model  the  behavior  of  y  given 
X.  This  is  not  a  restriction  if  /(y|X)  and  /(X)  depend  on  mutually  exclusive  sets 
of  parameters.  When  this  is  the  case  it  is  common  terminology  to  drop  the  adjective 
conditional.  For  rare  exceptions  such  as  endogenous  sampling  (see  Chapters  3  and 
24)  consistent  estimation  requires  that  estimation  is  based  on  the  full  joint  density 
/( y,  X|0)  rather  than  the  conditional  density  /( y|X,  9). 
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Table  5.3.  Maximum  Likelihood:  Commonly  Used  Densities 


Model 

Range  of  y 

Density  f(y) 

Common  Parameterization 

Normal 

(— oo,  oo) 

[27TCT2]-1/2e-(>’-M)2/2<T2 

pt  =  x'/3,  er2  =  a2 

Bernoulli 

0  or  1 

Py{  1  -  P)l~y 

Logit  p  =  ex,3/(l  +  ex^) 

Exponential 

(0,  oo) 

Xe~ky 

A  =  cxd  or  1/A  = 

Poisson 

0,  1,2, ... 

e~xU /y\ 

<0. 

II 

For  cross-section  data  the  observations  (y,-,  x,)  are  independent  over  i  with  condi¬ 
tional  density  function  f(yt  |x, ,  6).  Then  by  independence  the  joint  conditional  density 
/( y|X,  0)=n£  :1  /(>',  |x; ,  6),  leading  to  the  (conditional)  log-likelihood  function 

1  N 

QN(0 )  =  N-1Cn(0)  =  -  ^ln/(y,  |x,.,  9),  (5.39) 

1  =  1 

where  we  divide  by  N  so  that  the  objective  function  is  an  average. 

Results  extend  to  multivariate  data,  systems  of  equations,  and  panel  data  by  re¬ 
placing  the  scalar  y-t  by  vector  y,  and  letting  /(y,  |x(- ,  G)  be  the  joint  density  of  y, 
conditional  on  x,  .  See  also  Section  5.7.5. 


Examples 

Across  a  wide  range  of  data  types  the  following  method  is  used  to  generate  fully 
parametric  cross-section  regression  models.  First  choose  the  one-parameter  or  two- 
parameter  (or  in  some  rare  cases  three-parameter)  distribution  that  would  be  used  for 
the  dependent  variable  y  in  the  iid  case  studied  in  a  basic  statistics  course.  Then  pa¬ 
rameterize  the  one  or  two  underlying  parameters  in  terms  of  regressors  x  and  para¬ 
meters  6. 

Some  commonly  used  distributions  and  parameterizations  are  given  in  Table  5.3. 
Additional  distributions  are  given  in  Appendix  B,  which  also  presents  methods  to  draw 
pseudo-random  variates. 

For  continuous  data  on  (— oo,  oo),  the  normal  is  the  standard  distribution.  The  clas¬ 
sical  linear  regression  model  sets  //  =  x' (3  and  assumes  a 2  is  constant. 

For  discrete  binary  data  taking  values  0  or  1,  the  density  is  always  the  Bernoulli, 
a  special  case  of  the  binomial  with  one  trial.  The  usual  parameterizations  for  the 
Bernoulli  probability  lead  to  the  logit  model,  given  in  Table  5.3,  and  the  probit  model 
with  p  =  <&(x'/3),  where  <&(•)  is  the  standard  normal  cumulative  distribution  function. 
These  models  are  analyzed  in  Chapter  14. 

For  positive  continuous  data  on  (0,  oo),  notably  duration  data  considered  in  Chap¬ 
ters  17-19,  the  richer  Weibull,  gamma,  and  log-normal  models  are  often  used  in  addi¬ 
tion  to  the  exponential  given  in  Table  5.3. 

For  integer-valued  count  data  taking  values  0,  1,2,...  (see  Chapter  20)  the  richer 
negative  binomial  is  often  used  in  addition  to  the  Poisson  presented  in  Section  5.2.1. 
Setting  A  =  cxp(x'fi)  ensures  a  positive  conditional  mean. 
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For  incompletely  obsen’ed  data,  censored  or  truncated  variants  of  these  distributions 
may  be  used.  The  most  common  example  is  the  censored  normal,  which  is  called  the 
Tobit  model  and  is  presented  in  Section  16.3. 

Standard  likelihood-based  models  are  rarely  specified  by  making  assumptions  on 
the  distribution  of  an  error  term.  They  are  instead  defined  directly  in  terms  of  the 
distribution  of  the  dependent  variable.  In  the  special  case  that  y  ~  Af\x'fi,a2\  we  can 
equivalently  define  y  =  x'/3  +  u,  where  the  error  term  u  ~  Af[0, cr2].  However,  this 
relies  on  an  additive  property  of  the  normal  shared  by  few  other  distributions.  For 
example,  if  y  is  Poisson  distributed  with  mean  exp(x'/3)  we  can  always  write  v  = 
exp(x'/3)  +  u,  but  the  error  u  no  longer  has  a  familiar  distribution. 


5.6.2.  Maximum  Likelihood  Estimator 

The  maximum  likelihood  estimator  (MLE)  is  the  estimator  that  maximizes  the  (con¬ 
ditional)  log-likelihood  function  and  is  clearly  an  extremum  estimator.  Usually  the 
MLE  is  the  local  maximum  that  solves  the  first-order  conditions 

1  dCN{9)  _  1  y-^  9  In  f (Vj  |X; ,  0)  _ 

N  30  N  4-f  30 

More  formally  this  estimator  is  the  conditional  MLE,  as  it  is  based  on  the  conditional 
density  of  y  given  x,  but  it  is  common  practice  to  use  the  simpler  term  MLE. 

The  gradient  vector  dCN(9)/d6  is  called  the  score  vector,  as  it  sums  the  first  deriva¬ 
tives  of  the  log  density,  and  when  evaluated  at  0 o  it  is  called  the  efficient  score. 


5.6.3.  Information  Matrix  Equality 

The  results  of  Section  5.3  simplify  for  the  MLE,  provided  the  density  is  correctly 
specified  and  is  one  for  which  the  range  of  y  does  not  depend  on  0. 


Regularity  Conditions 

The  ML  regularity  conditions  are  that 


and 


dln/(y|x,  0) 

3  0 


~ / 


31n  /(y|x,  0) 

3  0 


f(y |x,  9)  =  0 


1"  32  hi  / (v |x,  0)1 

—  P  . 

~3  In  /fix,  0)  31n/(y|x,  0)" 

3030' 

30  30' 

(5.41) 


(5.42) 


where  the  notation  E^  [•]  is  used  to  make  explicit  that  the  expectation  is  with  respect  to 
the  specified  density  f(y  |x,  9).  Result  (5.41)  implies  that  the  score  vector  has  expected 
value  zero,  and  (5.42)  yields  (5.44). 

Derivation  given  in  Section  5.6.7  requires  that  the  range  of  y  does  not  depend  on  6 
so  that  integration  and  differentiation  can  be  interchanged. 
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Information  Matrix  Equality 


The  information  matrix  is  the  expectation  of  the  outer  product  of  the  score  vector, 


dCN{Q)  8CN(9) 
do  89' 


(5.43) 


The  terminology  information  matrix  is  used  as  X  is  the  variance  of  8CN(9)/89,  since 
by  (5.41)  3£jv(0)/3 9  has  mean  zero.  Then  large  values  of  X  mean  that  small  changes 
in  9  lead  to  large  changes  in  the  log-likelihood,  which  accordingly  contains  consider¬ 
able  information  about  9.  The  quantity  X  is  more  precisely  called  Fisher  Information, 
as  there  are  alternative  information  measures. 

For  log-likelihood  function  (5.39),  the  regularity  condition  (5.42)  implies  that 


'  3 2Cn(9) 

—  F  r 

'  3 Cn(9)  8Cn(9) 

8999’ 

e0_ 

3  9  89' 

e0_ 

if  the  expectation  is  with  respect  to  /(y|x,  9 o).  The  relationship  (5.44)  is  called  the 
information  matrix  (IM)  equality  and  implies  that  the  information  matrix  also  equals 
—TL[82  Cn{9)  / 9999'].  The  IM  equality  (5.44)  implies  that  —  Ao  =  Bo,  where  Ao  and 
Bo  are  defined  in  (5.18)  and  (5.19).  Theorem  5.3  then  simplifies  since  A0  1  BqA()  1  = 
_  a-1  —  R  1 

The  equality  (5.42)  is  in  turn  a  special  case  of  the  generalized  information  matrix 
equality 


E  / 


"3m(y,  9)' 

—  — Fr 

99' 

m(y,  9) 


31n/(.v|g) 

3  9' 


(5.45) 


where  m(-)  is  a  vector  moment  function  with  E  (  [m(y,  9)]  =  0  and  expectations  are 
with  respect  to  the  density  f(y\9).  This  result,  also  obtained  in  Section  5.6.7,  is  used 
in  Chapters  7  and  8  to  obtain  simpler  forms  of  some  test  statistics. 


5.6.4.  Distribution  of  the  ML  Estimator 

The  regularity  conditions  (5.41)  and  (5.42)  lead  to  simplification  of  the  general  results 
of  Section  5.3. 

The  essential  consistency  condition  (5.25)  is  that  E[3  In  /(y|x,  9)/99\efi  =0.  This 
holds  by  the  regularity  condition  (5.41),  provided  the  expectation  is  with  respect  to 
/(y|x,  9q).  Thus  if  the  dgp  is  /(y|x,  9f),  that  is,  the  density  has  been  correctly  speci¬ 
fied,  the  MLE  is  consistent  for  9q. 

For  the  asymptotic  distribution,  simplification  occurs  since  — Ao  =  Bo  by  the  IM 
equality,  which  again  assumes  that  the  density  is  correctly  specified. 

These  results  can  be  collected  into  the  following  proposition. 

Proposition  5.5  (Distribution  of  ML  Estimator):  Make  the  following  assump¬ 
tions: 

(i)  The  dgp  is  the  conditional  density  /(>’,- |x,-.  9q)  used  to  define  the  likelihood 
function. 
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(ii)  The  density  function  /(•)  satisfies  f(y,  9il))  —  f(y,  9(1))  iff  0(1)  = 


(iii)  The  matrix 


Ao 


=  plim 


j_  d2cN{6) 
N  8989' 


(5.46) 


exists  and  is  finite  nonsingular. 

(iv)  The  order  of  differentiation  and  integration  of  the  log-likelihood  can  be  re¬ 
versed. 


Then  the  ML  estimator  9ml,  defined  to  be  a  solution  of  the  first-order  conditions 
dN^1  Cn(9) /dO  =  0,  is  consistent  for  9q,  and 

VN(9ml  -  0O)  4  fif  [0,  -  V] .  (5.47) 


Condition  (i)  states  that  the  conditional  density  is  correctly  specified;  conditions 
(i)  and  (ii)  ensure  that  6 o  is  identified;  condition  (iii)  is  analogous  to  the  assumption 
on  plim  /V  1  X'X  in  the  case  of  OLS  estimation;  and  condition  (iv)  is  necessary  for  the 
regularity  conditions  to  hold.  As  in  the  general  case  probability  limits  and  expectations 
are  with  respect  to  the  dgp  for  (y,  X),  or  with  respect  to  just  y  if  regressors  are  assumed 
to  be  nonstochastic  or  analysis  is  conditional  on  X. 

Relaxation  of  condition  (i)  is  considered  in  detail  in  Section  5.7.  Most  ML  examples 
satisfy  condition  (iv),  but  it  does  rule  out  some  models  such  as  y  uniformly  distributed 
on  the  interval  [0,  0]  since  in  this  case  the  range  of  y  varies  with  6.  Then  not  only 
does  Ao  /  —Bo  but  the  global  MLE  converges  at  a  rate  other  than  \f~N  and  has  limit 
distribution  that  is  nonnormal.  See,  for  example,  Hirano  and  Porter  (2003). 

Given  Proposition  5.5,  the  resulting  asymptotic  distribution  of  the  MLE  is  often 
expressed  as 


■U 


d2CN(9) 

8989' 


(5.48) 


where  for  notational  simplicity  the  evaluation  at  9{)  is  suppressed  and  we  assume  that 
an  LLN  applies  so  that  the  plim  operator  in  the  definition  of  Ao  is  replaced  by  limE 
and  then  drop  the  limit.  This  notation  is  often  used  in  later  chapters. 

The  right-hand  side  of  (5.48)  is  the  Cramer-Rao  lower  bound  (CRLB),  which  from 
basic  statistics  courses  is  the  lower  bound  of  the  variance  of  unbiased  estimators  in 
small  samples.  For  large  samples,  considered  here,  the  CRLB  is  the  lower  bound  for 
the  variance  matrix  of  consistent  asymptotically  normal  (CAN)  estimators  with  con¬ 
vergence  to  normality  of  s/~N{9  —  9q)  uniform  in  compact  intervals  of  9q  (see  Rao, 
1973,  pp.  344-351).  Loosely  speaking  the  MLE  has  the  strong  attraction  of  having 
the  smallest  asymptotic  variance  among  root— N  consistent  estimators.  This  result  re¬ 
quires  the  strong  assumption  of  correct  specification  of  the  conditional  density. 


5.6.5.  Weibull  Regression  Example 

As  an  example,  consider  regression  based  on  the  Weibull  distribution,  which  is  used  to 
model  duration  data  such  as  length  of  unemployment  spell  (see  Chapter  17). 


143 


MAXIMUM  LIKELIHOOD  AND  NONLINEAR  LEAST-SQUARES  ESTIMATION 


The  density  for  the  Weibull  distribution  is  f(y  )  =  yaya~ 1  exp(— yya),  where  y  >  0 
and  the  parameters  a  >  0  and  y  >  0.  It  can  be  shown  that  E[y]  =  y~l^aT(a~l  +  1), 
where  T(-)  is  the  gamma  function.  The  standard  Weibull  regression  model  is  obtained 
by  specifying  y  =  exp (x'/3),  in  which  case  E[y|x]  =  exp(— x'/3/a)T(a~l  +  1).  Given 
independence  over  i  the  log-likelihood  function  is 

N~1Cn(G )  =  YJx>(3  +  In  a  +  (a  -  1 )  In  -  exp(xj/3)y“}. 

Differentiation  with  respect  to  (3  and  a  leads  to  the  first-order  conditions 


N  1  £,{1  -  exp(x'/3)>f}x,  =  0, 

N~l  Hdh  +  ln  y>  -  exp(x'/3)y"  In  v, }  =  0. 


Unlike  the  Poisson  example,  consistency  essentially  requires  correct  specification 
of  the  distribution.  To  see  this,  consider  the  first-order  conditions  for  /3.  The  informal 
condition  (5.25)  that  E[(l  —  exp(x'/3)y"}x]  =  0  requires  that  E[y"|x]  =  exp(— x'/3), 
where  the  power  a  is  not  restricted  to  be  an  integer.  The  first-order  conditions  for  a 
lead  to  an  even  more  esoteric  moment  condition  on  y. 

So  we  need  to  proceed  on  the  assumption  that  the  density  is  indeed  Weibull  with 
y  =  exp(x'/30)  and  a  =  ao-  Theorem  5.5  can  be  applied  as  the  range  of  y  does  not  de¬ 
pend  on  the  parameters.  Then,  from  (5.48),  the  Weibull  MLE  is  asymptotically  normal 
with  asymptotic  variance 


V 

=  (-E 

a 

V  L 

y  _eH-iP  O-y^x-X7  y 


y“°  ln(y,)x, 


Eidi 


-l 


(5.49) 


where  d\  =  — ( I  jay)  —  ex'^°y"°(ln  y,-)2.  The  matrix  inverse  in  (5.49)  needs  to  be  ob¬ 
tained  by  partitioned  inversion  because  the  off-diagonal  term  d2 CN{(3,a)/d(3da.  does 
not  have  expected  value  zero.  Simplification  occurs  in  models  with  zero  expected 
cross-derivative  E[32£A'(/3,a)/9/33a']  =  0,  such  as  regression  with  normally  dis¬ 
tributed  eiTors,  in  which  case  the  information  matrix  is  said  to  be  block  diagonal 
in  (3  and  a. 


5.6.6.  Variance  Matrix  Estimation  for  MLE 

There  are  several  ways  to  consistently  estimate  the  variance  matrix  of  an  extremum 
estimator,  as  already  noted  in  Section  5.5.2.  For  the  MLE  additional  possibilities  arise 
if  the  information  matrix  equality  is  assumed  to  hold.  Then  Aq1  BqAq1  ,  —  A0  1 ,  and  B0  1 
are  all  asymptotically  equivalent,  as  are  the  corresponding  consistent  estimates  of  these 
quantities.  A  detailed  discussion  for  the  MLE  is  given  in  Davidson  and  MacKinnon 
(1993,  chapter  18). 

The  sandwich  estimate  A  1  BA  1  is  called  the  Huber  estimate,  after  Huber  (1967), 
or  White  estimate,  after  White  (1982),  who  considered  the  distribution  of  the  MLE 
without  imposing  the  information  matrix  equality.  The  sandwich  estimate  is  in  theory 
more  robust  than  —  A-1  or  B  1 .  It  is  important  to  note,  however,  that  the  cause  of  fail¬ 
ure  of  the  information  matrix  equality  may  additionally  lead  to  the  more  fundamental 
complication  of  inconsistency  of  6ml ■  This  is  the  subject  of  Section  5.7. 
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5.6.7.  Derivation  of  ML  Regularity  Conditions 


We  now  formally  derive  the  regularity  conditions  stated  in  Section  5.6.3.  For  notational 
simplicity  the  subscript  i  and  the  regressor  vector  are  suppressed. 

Begin  by  deriving  the  first  condition  (5.41).  The  density  integrates  to  one,  that  is. 


/• 


f(y\0)dy  =  1. 


Differentiating  both  sides  with  respect  to  9  yields  jg  J  f(y\0)dy  =  0.  If  the  range  of 
integration  (the  range  of  y)  does  not  depend  on  6  this  implies 

df(y\9) 


/ 


89 


-dy  —  0. 


Now  3  In  f(y\9)/89  =  [3/(y|0)/30]/[/(y|#)],  which  implies 

8f(y\9)  8  In  f(y\9) 


89  89 

Substituting  (5.51)  in  (5.50)  yields 

31n/(y|0) 


-f(y\6). 


J 


89 


f{y\9)dy  =  0, 


(5.50) 


(5.51) 


(5.52) 


which  is  (5.41)  provided  the  expectation  is  with  respect  to  the  density  f(y\9). 

Now  consider  the  second  condition  (5.42),  initially  deriving  a  more  general  result. 
Suppose 

E[m(y,  9)}  =  0, 

for  some  (possibly  vector)  function  m().  Then  when  the  expectation  is  taken  with 
respect  to  the  density  f(y  \  9) 


/ 


m(y,9)f(y\9)dy  =  0. 


(5.53) 


Differentiating  both  sides  with  respect  to  9'  and  assuming  differentiation  and  integra¬ 
tion  are  interchangeable  yields 


/ 


3m  (y,B)  fl3/(y|0)\.  n 

f(y\9)  +  m(y ,  9)  dy  =  0. 


89’ 


89' 


(5.54) 


Substituting  (5.51)  in  (5.54)  yields 

'3m(y,  9) 


J 


89' 


f(y\9)  +  m(y,  9)d  ln {(Jld)  f(y\9) }  dy  =  0,  (5.55) 


89 


or 


3m(y,  9) 

—  — F 

89' 

m(y,  9) 


3  In  /(y|0) 
89' 


(5.56) 


when  the  expectation  is  taken  with  respect  to  the  density  f(y\9).  The  regularity  con¬ 
dition  (5.42)  is  the  special  case  m(y,  9)  =  3  In  f(y\9)/89  and  leads  to  the  IM  equality 
(5.44).  The  more  general  result  (5.56)  leads  to  the  generalized  IM  equality  (5.45). 
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What  happens  when  integration  and  differentiation  cannot  be  interchanged?  The 
starting  point  (5.50)  no  longer  holds,  as  by  the  fundamental  theorem  of  calculus 
the  derivative  with  respect  to  9  of  f  f(y\9)dy  includes  an  additional  term  reflecting 
the  presence  of  a  function  9  in  the  range  of  the  integral.  Then  E[3  In  f(y\9)/'<)9\  /  0. 

What  happens  when  the  density  is  misspecified?  Then  (5.52)  still  holds,  but  it  does 
not  necessarily  imply  (5.41),  since  in  (5.41)  the  expectation  will  no  longer  be  with 
respect  to  the  specified  density  f(y\9). 


5.7.  Quasi-Maximum  Likelihood 

The  quasi-MLE  0qml  is  defined  to  be  the  estimator  that  maximizes  a  log-likelihood 
function  that  is  misspecified,  as  the  result  of  specification  of  the  wrong  density.  Gen¬ 
erally  such  misspecification  leads  to  inconsistent  estimation. 

In  this  section  general  properties  of  the  quasi-MLE  are  presented,  followed  by  some 
special  cases  where  the  quasi-MLE  retains  consistency. 


5.7.1.  Psuedo-True  Value 

In  principle  any  misspecification  of  the  density  may  lead  to  inconsistency,  as  then  the 
expectation  in  evaluation  of  E[3  In /(y|x,  9)/'d9\fh  |  (see  Section  5.6.4)  is  no  longer 
with  respect  to  f(y  |x,  90). 

By  adaptation  of  the  general  consistency  proof  in  Section  5.3.2,  the  quasi-MLE 
0qml  converges  in  probability  to  the  pseudo-true  value  9*  defined  as 

9*  =  argmax0e©(plim  N~l  Cn(9)).  (5.57) 

The  probability  limit  is  taken  with  respect  to  the  true  dgp.  If  the  true  dgp  differs 
from  the  assumed  density  f(y  |x,  9)  used  to  form  £n(9),  then  usually  9*  9I}  and 

the  quasi-MLE  is  inconsistent. 

Huber  (1967)  and  White  (1982)  showed  that  the  asymptotic  distribution  of  the 
quasi-MLE  is  similar  to  that  for  the  MLE,  except  that  it  is  centered  around  9*  and 
the  IM  equality  no  longer  holds.  Then 


V N(9qml  -  9*)  4  M [0,  A^'B^A*-1] ,  (5.58) 

where  A*  and  B*  are  as  defined  in  (5.18)  and  (5.19)  except  that  probability  limits 
are  taken  with  respect  to  the  unknown  true  dgp  and  are  evaluated  at  9*.  Consistent 
estimates  A*  and  B*  can  be  obtained  as  in  Section  5.5.2,  with  evaluation  at  0qML. 

This  distributional  result  is  used  for  statistical  inference  if  the  quasi-MLE  retains 
consistency.  If  the  quasi-MLE  is  inconsistent  then  usually  9*  has  no  simple  interpre¬ 
tation,  aside  from  that  given  in  the  next  section.  However,  (5.58)  may  still  be  useful  if 
nonetheless  there  is  interest  in  knowing  the  precision  of  estimation.  The  result  (5.58) 
also  provides  motivation  for  White’s  information  matrix  test  (see  Section  8.2.8)  and 
for  Vuong’s  test  for  discriminating  between  parametric  models  (see  Section  8.5.3). 
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5.7.2.  Kullback-Liebler  Distance 

Recall  from  Section  4.2.3  that  if  E[y|x]  /  x'/30  then  the  OLS  estimator  can  still  be 
interpreted  as  the  best  linear  predictor  of  E[y|x]  under  squared  error  loss.  White  (1982) 
proposed  a  qualitatively  similar  interpretation  for  the  quasi-MLE. 

Let  f(y\6)  denote  the  assumed  joint  density  of  y1; . . . ,  y#  and  let  h( y)  denote  the 
true  density,  which  is  unknown,  where  for  simplicity  dependence  on  regressors  is  sup¬ 
pressed.  Define  the  Kullback-Leibler  information  criterion  (KLIC) 


KLIC  =  E 


In 


h( y)  V 
/( J\8))_  ’ 


(5.59) 


where  expectation  is  with  respect  to  h{ y).  KLIC  takes  a  minimum  value  of  0  when 
there  is  a  0q  such  that  h(y)  =  f{y\6o),  that  is,  the  density  is  correctly  specified,  and 
larger  values  of  KLIC  indicate  greater  ignorance  about  the  true  density. 

Then  the  quasi-MLE  0qml  minimizes  the  distance  between  /( y\0)  and  h(y),  where 
distance  is  measured  using  KLIC.  To  obtain  this  result,  note  that  under  suitable 
assumptions  plim  N~1£n(0)  =  E[ln/(y|0)],  so  0qml  converges  to  6*  that  maxi¬ 
mizes  E[ln/(y|0)].  However,  this  is  equivalent  to  minimizing  KLIC,  since  KLIC  = 
E[ln/z(y)]  —  L[ln  f(y\0)\  and  the  first  term  does  not  depend  on  6  as  the  expectation  is 
with  respect  to  h( y). 


5.7.3.  Lineal-  Exponential  Lamily 

In  some  special  cases  the  quasi-MLE  is  consistent  even  when  the  density  is  partially 
misspecified.  One  well-known  example  is  that  the  quasi-MLE  for  the  linear  regres¬ 
sion  model  with  normality  is  consistent  even  if  the  errors  are  nonnormal,  provided 
E[y|x]  =  x' (30.  The  Poisson  MLE  provides  a  second  example  (see  Section  5.3.4). 

Similar  robustness  to  misspecification  is  enjoyed  by  other  models  based  on  densities 
in  the  linear  exponential  family  (LEE).  An  LEF  density  can  be  expressed  as 

/(y  I/O  =  exp{fl(M)  +  b(y)  +  c(/r)y},  (5.60) 

where  we  have  given  the  mean  parameterization  of  the  LEF,  so  that  /i  =  E[y] .  It  can 
be  shown  that  for  this  density  E[y]  =  —  and  V[y]  =  [c'(/OD\  where 

c'(/0  =  dc{ji)/dii  and  a'(fi)  =  3fl(/x)/3/r.  Different  functions  «(•)  and  c(-)  lead  to 
different  densities  in  the  family.  The  term  b(y)  in  (5.60)  is  a  normalizing  constant  that 
ensures  probabilities  sum  or  integrate  to  one.  The  remainder  of  the  density  exp {a(/i)  + 
c{ji)y }  is  an  exponential  function  that  is  linear  in  y,  hence  explaining  the  term  linear 
exponential. 

Most  densities  cannot  be  expressed  in  this  form.  Several  important  densities  are 
LEF  densities,  however,  including  those  given  in  Table  5.4.  These  densities,  already 
presented  in  Table  5.3,  are  reexpressed  in  Table  5.4  in  the  form  (5.60).  Other  LEF 
densities  are  the  binomial  with  number  of  trials  known  (the  Bernoulli  being  a  special 
case),  some  negative  binomials  models  (the  geometric  and  the  Poisson  being  special 
cases),  and  the  one-parameter  gamma  (the  exponential  being  a  special  case). 
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Table  5.4.  Linear  Exponential  Family  Densities:  Leading  Examples 


Distribution 

/  09  =  exp{a(0  +  My)  +  c(-)y} 

E  [y] 

V[y]  =  [c'(p)rl 

Normal  (a1  known) 

exP(  jfr  -  5  ln(27rcr2)  -  £  +  £y] 

p 

a2 

Bernoulli 

exp{ln(l  -  p)  +  In [ /? / ( 1  -  p)]y} 

P  —  P 

p(  1  -  M) 

Exponential 

expflnA.  —  Ay) 

p  —  1/k 

p1 

Poisson 

exp{— X  —  Inyl  +  y  In  k} 

p  —  X 

p 

For  regression  the  parameter  p  =  E[y|x]  is  modeled  as 

p  =  g(x,  (3), 


(5.61) 


for  specified  function  g(-)  that  varies  across  models  (see  Section  5.7.4)  depending 
in  part  on  restrictions  on  the  range  of  y  and  hence  p.  The  LEF  log-likelihood  is 
then 

N 

CN((3)  =  (3))  +  My,)  +  c(g(xh  /3))y,-},  (5.62) 

i= 1 

with  first-order  conditions  that  can  be  reexpressed,  using  the  aforementioned  informa¬ 
tion  on  the  first-two  moments  of  y,  as 


BCNU3)  _  y .  y,-  -  g(x;,  (3 )  ^  dgjxj ,  (3)  _ 

3/3  ^  of  X  3/3 


(5.63) 


where  of  =  [c'(g(x, ,  /3))]  1  is  the  assumed  variance  function  corresponding  to  the  par¬ 
ticular  LEF  density.  For  example,  for  Bernoulli,  exponential,  and  Poisson,  of  equals, 
respectively,  g,(l  -  g,).  l/gf,  and  g,,  where  g,  =  g(xh  (3). 

The  quasi-MLE  solves  these  equations,  but  it  is  no  longer  assumed  that  the  LEF 
density  is  correctly  specified.  Gourieroux,  Monfort,  and  Trognon  (1984a)  proved  that 
the  quasi-MLE  /3qML  is  consistent  provided  E[y|x]  =  g(x,  (30).  This  is  clear  from 
taking  the  expected  value  of  the  first-order  conditions  (5.63),  which  evaluated  at 
(3  =  (30  are  a  weighted  sum  of  errors  y  —  g(x.  (30)  with  expected  value  equal  to  zero 
if  E[  v|x]  =  g(x,/30). 

Thus  the  quasi-MLE  based  on  an  LEF  density  is  consistent  provided  only  that  the 
conditional  mean  of  y  given  x  is  correctly  specified.  Note  that  the  actual  dgp  for  y 
need  not  be  LEF.  It  is  the  specified  density,  potentially  incorrectly  specified,  that  is 
LEF. 

Even  with  correct  conditional  mean,  however,  adjustment  of  default  ML  output  for 
variance,  standard  errors,  and  /-statistics  based  on  —  Aq  1  is  warranted.  In  general  the 
sandwich  form  Aq  ^oAq  1  should  be  used,  unless  the  conditional  variance  of  v  given 
x  is  also  correctly  specified,  in  which  case  Ao  =  —  Bq.  For  Bernoulli  models,  how¬ 
ever,  Ao  =  —Bo  always.  Consistent  standard  errors  can  be  obtained  using  (5.36)  and 
(5.38). 

The  LEF  is  a  very  special  case.  In  general,  misspecification  of  any  aspect  of  the 
density  leads  to  inconsistency  of  the  MLE.  Even  in  the  LEF  case  the  quasi-MLE  can 
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be  used  only  to  predict  the  conditional  mean  whereas  with  a  correctly  specified  density 
one  can  predict  the  conditional  distribution. 


5.7.4.  Generalized  Linear  Models 

Models  based  on  an  assumed  LEF  density  are  called  generalized  linear  models 
(GLMs)  in  the  statistics  literature  (see  the  book  with  this  title  by  McCullagh  and 
Nelder,  1989).  The  class  of  generalized  linear  models  is  the  most  widely  used  frame¬ 
work  in  applied  statistics  for  nonlinear  cross-section  regression,  as  from  Table  5.3  it 
includes  nonlinear  least  squares,  Poisson,  geometric,  probit,  logit,  binomial  (known 
number  of  trials),  gamma,  and  exponential  regression  models.  We  provide  a  short 
overview  that  introduces  standard  GLM  terminology. 

Standard  GLMs  specify  the  conditional  mean  g(x,  (3)  in  (5.61)  to  be  of  the  simpler 
single-index  form,  so  that  /i  =  g(x'(3).  Then  g~ 1  i/x)  =  x'/L  and  the  function  g  1  ( ■ )  is 
called  the  link  function.  For  example,  the  usual  specification  for  the  Poisson  model 
corresponds  to  the  log-link  function  since  if  //  =  cxp(x'ft)  then  In  //  =  x'/L 

The  first-order  conditions  (5.63)  become  —  g/Vc'/g,  )]g'x,  =  0,  where  g,  = 

g(x'j(3)  and  g'  =  g'(x'/3).  There  are  computational  advantages  in  choosing  the  link 
function  so  that  c'(g(n))  =  g'(n),  since  then  these  first-order  conditions  reduce  to 
J2j(yi  —  Mi)xi  =  0,  or  the  error  tv,  —  g, )  is  orthogonal  to  the  regressors.  The  canonical 
link  function  is  defined  to  be  that  function  g_1(-)  which  leads  to  r'tgt/r))  =  g't/x)  and 
varies  with  c(/i)  and  hence  the  GLM.  The  canonical  link  function  leads  to  //  =  x'(3  for 
normal,  //  =  exp(x'/3)  for  Poisson,  and  /i  =  exp(x'/3)/[l  +  exp (x'/3)]  for  binary  data. 
The  last  of  these  is  the  logit  form  given  earlier  in  Table  5.3. 

Two  times  the  difference  between  the  maximum  achievable  log-likelihood  and  the 
fitted  log-likelihood  is  called  the  deviance,  a  measure  that  generalizes  the  residual  sum 
of  squares  in  linear  regression  to  other  LEF  regression  models. 

Models  based  on  the  LEF  are  very  restrictive  as  all  moments  depend  on  just  one  un¬ 
derlying  parameter,  fi  =  g(x'/3).  The  GLM  literature  places  some  additional  structure 
by  making  the  convenient  assumption  that  the  LEF  variance  is  potentially  misspecified 
by  a  scalar  multiple  a,  so  that  V[y|x]  =  a  x  [cLg/x,  /3)]~\  where  a  ^  1  necessarily. 
For  example,  for  the  Poisson  model  let  V[y|x]  =  ug(x.  (3)  rather  than  g(x,  (3).  Given 
such  variance  misspecification  it  can  be  shown  that  Bo  =  —  aAo,  so  the  variance  matrix 
of  the  quasi-MLE  is  —  kAq  l,  which  requires  only  a  rescaling  of  the  nonsandwich  ML 
variance  matrix  —  Aq  1  by  multiplication  by  a.  A  commonly  used  consistent  estimate 
for  a  is  a  =  (N  -  K)~l  Y^iy,  where  g(-  =  g(xh  3qml)>  =  W(gi)Yl, 

and  division  is  by  (N  —  K)  rather  than  N  is  felt  to  provide  a  better  estimate  in  small 
samples.  See  the  preceding  references  and  Cameron  and  Trivedi  (1986,  1998)  for  fur¬ 
ther  details. 

Many  statistical  packages  include  a  GLM  module  that  as  a  default  gives  standard 
errors  that  are  correct  provided  V [ y  | x |  =  a.[c\g(x,  (3))\  ] .  Alternatively,  one  can  es¬ 
timate  using  ML,  with  standard  errors  obtained  using  the  robust  sandwich  formula 
Aq  *BoAq  *.  In  practice  the  sandwich  standard  errors  are  similar  to  those  obtained  us¬ 
ing  the  simple  GLM  correction.  Yet  another  way  to  estimate  a  GLM  is  by  weighted 
nonlinear  least  squares,  as  detailed  at  the  end  of  Section  5.8.6. 
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5.7.5.  Quasi-MLE  for  Multivariate  Dependent  Variables 


This  chapter  has  focused  on  scalar  dependent  variables,  but  the  theory  applies  also  to 
the  multivariate  case.  Suppose  the  dependent  variable  y  is  an  m  x  1  vector,  and  the  data 
(y,- ,  x,  ),  i  =  1 ,  . . . ,  N,  are  independent  over  i.  Examples  given  in  later  chapters  include 
seemingly  unrelated  equations,  panel  data  with  m  observations  for  the  ith  individual 
on  the  same  dependent  variable,  and  clustered  data  where  data  for  the  i  j th  observation 
are  correlated  over  m  possible  values  of  j . 

Given  specification  of  /(y|x,  9),  the  joint  density  of  y  =(y\, . . . ,  ym)  conditional  on 
x,  the  fully  efficient  MLE  maximizes  N  1  JT  In  /(y,  |x, ,  9)  as  noted  after  (5.39).  How¬ 
ever,  in  multivariate  applications  the  joint  density  of  y  can  be  complicated.  A  simpler 
estimator  is  possible  given  knowledge  only  of  the  m  univariate  densities  fjiyj  |x,  9), 
j  =  1 , . . . ,  m,  where  y;  is  the  jth  component  of  y.  For  example,  for  multivariate  count 
data  one  might  work  with  m  independent  univariate  negative  binomial  densities  for 
each  count  rather  than  a  richer  multivariate  count  model  that  permits  correlation. 

Consider  then  the  quasi-MLE  0qML  based  on  the  product  of  the  univariate  densities, 
FI,  fj(y.i lx’  0)>  that  maximizes 

Y  N  m 

Gw(0)  =  t 7  E  E ln  /O'y  I*-'  •  e>-  (5-64) 

N  i=  1  7=1 


Wooldridge  (2002)  calls  this  estimator  the  partial  MLE,  since  the  density  has  been 
only  partially  specified. 

The  partial  MLE  is  an  m-estimator  with  qt  =  In  /Yy,;  | x, ,  9).  The  essential  con¬ 
sistency  condition  (5.25)  requires  that  E[£V  ()/Yyl;  | x, ,  9)/d9 \e  ]  =  0.  This  condi¬ 
tion  holds  if  the  marginal  densities  /(y,j|x,  ,  9q)  are  correctly  specified,  since  then 
E[ dfiyjj |x/ ,  9)/d9\g  ]  =  0  by  the  regularity  condition  (5.41). 

Thus  the  partial  MLE  is  consistent  provided  the  univariate  densities  fjiyj  |x,  9)  are 
correctly  specified.  Consistency  does  not  require  that  /(y|x,  0)  =  fl/  fjiyj  |x,  0).  De¬ 
pendence  of  y  i , . . . ,  ym  will  lead  to  failure  of  the  information  matrix  equality,  however, 
so  standard  errors  should  be  computed  using  the  sandwich  form  for  the  variance  matrix 
with 


1  \tV 


32ln  fij 


8999' 


B, 


1  ^ — V  ;V  \m  ^ — \lll  8  In  fij 

—  2-^i= 1 2—ii= 1 2—ik=  i 


99 


3  In  fa 


89' 


(5.65) 


where  f  j  =  /Y_v,;jx, ,  9).  Furthermore,  the  partial  MLE  is  inefficient  compared  to  the 
MLE  based  on  the  joint  density.  Further  discussion  is  given  in  Sections  6.9  and  6.10. 


5.8.  Nonlinear  Least  Squares 

The  NLS  estimator  is  the  natural  extension  of  LS  estimation  for  the  linear  model  to  the 
nonlinear  model  with  E[y|x]  =  «(x.  f3),  where  g(-)  is  nonlinear  in  (3.  The  analysis  and 
results  are  essentially  the  same  as  for  linear  least  squares,  with  the  single  change  that  in 
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Table  5.5.  Nonlinear  Least  Squares:  Common  Examples 


Model 

Regression  Function  g(x,  (3) 

Exponential 

exp(ffixi  +  p2X2  +  foxi) 

Regressor  raised  to  power 

P\X\  +  p2x2 

Cobb-Douglas  production 

PlX^X%3 

CES  production 

[ffixf3  + 

Nonlinear  restrictions 

Pix  1  +  p2x2  +  P3X3,  where  =  -#>£1 

the  formulas  for  variance  matrices  the  regressor  vector  x  is  replaced  by  dg(x,  (3)/d(3\p, 
the  derivative  of  the  conditional  mean  function  evaluated  at  (3  =  (3. 

For  microeconometric  analysis,  controlling  for  heteroskedastic  errors  may  be  neces¬ 
sary,  as  in  the  linear  case.  The  NLS  estimator  and  extensions  that  model  heteroskedas¬ 
tic  errors  are  generally  less  efficient  than  the  MLE,  but  they  are  widely  used  in  microe¬ 
conometrics  because  they  rely  on  weaker  distributional  assumptions. 


5.8.1.  Nonlinear  Regression  Model 

The  nonlinear  regression  model  defines  the  scalar  dependent  variable  y  to  have  con¬ 
ditional  mean 


E[y,  |x,]  =  gfc,  (3),  (5.66) 

where  g(-)  is  a  specified  function,  x  is  a  vector  of  explanatory  variables,  and  [3  is  a 
K  x  1  vector  of  parameters.  The  linear  regression  model  of  Chapter  4  is  the  special 
case  g(x,  (3)  =  x'/3. 

Common  reasons  for  specifying  a  nonlinear  function  for  E[y|x]  include  range  re¬ 
striction  (e.g.,  to  ensure  that  E[y|x]  >  0)  and  specification  of  supply  or  demand  or 
cost  or  expenditure  models  that  satisfy  restrictions  from  producer  or  consumer  theory. 
Some  commonly  used  nonlinear  regression  models  are  given  in  Table  5.5. 


5.8.2.  NLS  Estimator 

The  error  term  is  defined  to  be  the  difference  between  the  dependent  variable 
and  its  conditional  mean,  yt  —  g(xit  (3).  The  nonlinear  least-squares  estimator 
/3Nls  minimizes  the  sum  of  squared  residuals,  ^L(y,  —  g(x,-,/3))2,  or  equivalently 
maximizes 


1  N 

Qn((3)  =  X>'  -  ^x-/3))2,  (5.67) 

1  =  1 

where  the  scale  factor  1  /2  simplifies  the  subsequent  analysis. 
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Differentiation  leads  to  the  NLS  first-order  conditions 


BQn(P) 

9/3 


dgj  f  _  ,  _ 

N  ^  9/3  'V''  8i  ~  ’ 


(5.68) 


where  g,  =  g(Xj ,  /3).  These  conditions  restrict  the  residual  (y  —  g)  to  be  orthogonal  to 
dg/d/3,  rather  than  to  x  as  in  the  linear  case.  There  is  no  explicit  solution  for  /3NLS, 
which  instead  is  computed  using  iterative  methods  (given  in  Chapter  10). 

The  nonlinear  regression  model  can  be  more  compactly  represented  in  matrix  nota¬ 
tion.  Stacking  observations  yields 


yi 

g  i 

U\ 

— 

+ 

_yN_ 

_gN  _ 

_UN  _ 

where  g,  =  y(x, ,  f3),  or  equivalently 


y  =  g  +  u, 


(5.70) 


where  y,  g,  and  u  are  N  x  1  vectors  with  /th  entries  of,  respectively,  y,-,  g, .  and  u , . 
Then 


and 


where 


Qn(P)  =  -^(y-  g)'(y  -  g) 

BQn(P)  _  1  9g' 
d(3  N  d(3{y  g  ’ 


Sgi 

dgl v 

9s'  _ 

a  Pi 

3jS, 

9/3 

dgl 

3  gN 

_  apK 

a  Pk  _ 

(5.71) 


(5.72) 


is  the  K  x  N  matrix  of  partial  derivatives  of  g(x,  (3 )'  with  respect  to  (3. 


5.8.3.  Distribution  of  the  NLS  Estimator 

The  distribution  of  the  NLS  estimator  will  vary  with  the  dgp.  The  dgp  can  always  be 
written  as 


yi  =g(Xi,/30)  +  «i,  (5.73) 

a  nonlinear  regression  model  with  additive  error  u.  The  conditional  mean  is  correctly 
specified  if  E[y|x]  =  g(x,  j30)  in  the  dgp.  Then  the  error  must  satisfy  E[u|x]  =  0. 

Given  the  NLS  first-order  conditions  (5.68),  the  essential  consistency  condition 
(5.25)  becomes 


E[ dg(x,  /3)/dj3\po  x  (y  -  g(x,,p0))]  =  0. 
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Equivalently,  given  (5.73),  we  need  E[3g(x,  f3)/d/3\p  x  u]  =  0.  This  holds  if 
E[h|x]  =  0,  so  consistency  requires  correct  specification  of  the  conditional  mean  as 
in  the  linear  case.  If  instead  E[w|x]  /0  then  consistent  estimation  requires  nonlinear 
instrumental  methods  (which  are  presented  in  Section  6.5). 

The  limit  distribution  of  s/~N (/3nls  —  /30)  is  obtained  using  an  exact  first-order 
Taylor  series  expansion  of  the  first-order  conditions  (5.68).  This  yields 


/T7 /a  a  \  _  I  _  ^  \  &8i  dgi  1  \  ’  d~gi  .  . 

(Pnls  Po)~  (  N  E  9/3  9/3'  +  ^  E  df3d(3’(y<  8i 


-1 


1  f  dgi 

X  — =  >  - Hi 

VN  3/3 


do 


for  some  (3+  between  /3NLS  and  / 30 .  For  Ao  in  (5.18)  simplification  occurs  because 
the  term  involving  (32g/3/33/3')  drops  out  since  E[w|x]  =  0.  Thus  asymptotically  we 
need  consider  only 


4v(/3NLS  /3q)  — 


[  1  3 gi  dgi 

&0 
as  \ 

T 

yN  4;  3/3  3/3' 

which  is  exactly  the  same  as  OLS,  see  Section  4.4.4,  except  x,  is  replaced  by 

SftW'U- 

This  yields  the  following  proposition,  analogous  to  Proposition  4.1  for  the  OLS 
estimator. 


Proposition  5.6  (Distribution  of  NLS  Estimator):  Make  the  following 
assumptions: 


(i)  The  model  is  (5.73);  that  is,  y(-  =  g(x,-,  /30)  + 

(ii)  In  the  dgp  E[iij  |x,]  =  0  and  £[uu'|X]  =  $7o,  where  Tlojj  =  cr,9. 

(Hi)  The  mean  function  g(- )  satisfies  g(x,  /3(1))  =  g(x,  /3<2))  iff  (3{X)  —  (3(2\ 
(iv)  The  matrix 


A0  = 


3 gi  dgi 
d/3  d/3' 


do 


=  plim 


I 

N  3/3  3/3' 


do 


exists  and  is  finite  nonsingular. 

(v)  N~1/2  E,=i  dgi/d/3xui\Po  4  AT  [0,  B0],  where 


(5.74) 


i  N  N  o  o 
u  i-l  &8i  d§j 

B0  =  phm—  2^  2^  ' 


N 


a. 


ij 


3=1  7=1 


9/3  3/3' 


r  1  9§'o  d£ 
=  phm - i  to — 7 

ft  N  3/3  °3 0 


(5.75) 


do 


Then  the  NLS  estimator  f3NLS,  defined  to  be  a  root  of  the  first-order  conditions 
3/V-1  <2aT/3)/3/3  =  0,  is  consistent  for  fo  and 

440V,  -  /30)  4  at  [0,  Aq  !B0Aq  *] .  (5.76) 
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Conditions  (i)  to  (iii)  imply  that  the  regression  function  is  correctly  specified  and 
the  regressors  are  uncorrelated  with  the  errors  and  that  (30  is  identified.  The  errors  can 
be  heteroskedastic  and  correlated  over  i.  Conditions  (iv)  and  (v)  assume  the  relevant 
limit  results  necessary  for  application  of  Theorem  5.3.  For  condition  (v)  to  be  satisfied 
some  restrictions  will  need  to  be  placed  on  the  error  correlation  over  i .  The  probability 
limits  in  (5.74)  and  (5.75)  are  with  respect  to  the  dgp  for  X;  they  become  regular  limits 
if  X  is  nonstochastic. 

The  matrices  Ao  and  Bo  in  Proposition  5.6  are  the  same  as  the  matrices  Mxx 
and  Mx<ix  in  Section  4.4.4  for  the  OLS  estimator  with  x,  replaced  by  dgi/d(3 \p  . 
The  asymptotic  theory  for  NLS  is  the  same  as  that  for  OLS,  with  this  single 
change. 

In  the  special  case  of  spherical  errors,  12  q  =  <XqI,  so  Bo  =  ctq Ao  and  V[/3NLS]  = 
CTq  Aq  .  Nonlinear  least  squares  is  then  asymptotically  efficient  among  LS  estimators. 
However,  cross-section  data  errors  are  not  necessarily  heteroskedastic. 

Given  Proposition  5.6,  the  resulting  asymptotic  distribution  of  the  NLS  estimator 
can  be  expressed  as 


3nls  ~  Af  |y3, (D'Dr 1  D'r20D(D'Dr 1 J  ,  (5.77) 

where  the  derivative  matrix  D  =  3g/3/3'|^  has  ith  row  3g,/3/3' (see  (5.72)),  for 
notational  simplicity  the  evaluation  at  (30  is  suppressed,  and  we  assume  that  an  LLN 
applies,  so  that  the  plim  operator  in  the  definitions  of  Ao  and  Bo  are  replaced  by  limE, 
and  then  drop  the  limit.  This  notation  is  often  used  in  later  chapters. 


5.8.4.  Variance  Matrix  Estimation  for  NLS 

We  consider  statistical  inference  for  the  usual  microeconometrics  situation  of  inde¬ 
pendent  errors  with  heteroskedasticity  of  unknown  functional  form.  This  requires  a 
consistent  estimate  of  Aq  *BoAq  1  defined  in  Proposition  5.6. 

For  Aq  defined  in  (5.74)  it  is  straightforward  to  use  the  obvious  estimator 


A=lf 

df3 

as  Ao  does  not  involve  moments  of  the  errors. 

Given  independence  over  i  the  double  sum  in  Bo  defined  in  (5.75)  simplifies  to  the 
single  sum 


3  gi 
a  3/3' 


(5.78) 


p>  .-1  2  3g;  3 gi 

Bo  =  plim —  >  o'; - 7 

p  v  ^  '  9/3  3/3' 


i=i 


do 


As  for  the  OLS  estimator  (see  Section  4.4.5)  it  is  only  necessary  to  consistently  esti¬ 
mate  the  K  x  K  matrix  sum  Bo.  This  does  not  require  consistent  estimation  of  er?,  the 
N  individual  components  in  the  sum. 
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White  (1980b)  gave  conditions  under  which 


I  N 

N  *-! 


^2dgi_dgi_ 

u‘  3/3  3 0 


1  3g' 

-  3g 

$7  — E 

3  N  3/3 

3  3/3' 

is  consistent  for  Bo,  where  Uj  =  y,  —  g(x, ,  f3),  (3  is  consistent  for  /30,  and 

fl  =  Diag  [uj]. 


(5.79) 


(5.80) 


This  leads  to  the  following  heteroskedastic-consistent  estimate  of  the  asymptotic 
variance  matrix  of  the  NLS  estimator: 

V[3nls]  =  (D'D)_1D'f7D(D'D)_1,  (5.81) 

where  D  =  3g/3/3'|^.  This  equation  is  the  same  as  the  OLS  result  in  Section  4.4.5, 
with  the  regressor  matrix  X  replaced  by  D.  In  practice,  a  degrees  of  freedom  correction 
may  be  used,  so  that  B  in  (5.79)  is  computed  using  division  by  (N  —  K)  rather  than  by 
N.  Then  the  right-hand  side  in  (5.81)  should  be  multiplied  by  N /(N  —  K). 
Generalization  to  errors  correlated  over  i  is  given  in  Section  5.8.7. 


5.8.5.  Exponential  Regression  Example 

As  an  example,  suppose  that  y  given  x  has  exponential  conditional  mean,  so  that 
E[y|x]  =  exp(x'/3).  The  model  can  be  expressed  as  a  nonlinear  regression  with 

y  =  exp(x'/3)  +  u, 


where  the  error  term  u  has  E[m|x]  =  0  and  the  error  is  potentially  heteroskedastic. 
The  NLS  estimator  has  first-order  conditions 


N  1  (y,  -  exp(x'/3))  exp(xj/3)x;  =  0,  (5.82) 


so  consistency  of  /3NLS  requires  only  that  the  conditional  mean  be  correctly  specified 
with  E[y|x]  =  exp(x'/30).  Here  dg/d(3  =  exp(x'/3)x,  so  the  general  NLS  result  (5.81) 
yields  the  heteroskedastic -robust  estimate 


V[/3Nls] 


=  CL 


e2x‘^Xi\] 


0"  E 


u2le2x‘lixlxl 


(E, 


e2x‘^x,x) 


0 


-1 


(5.83) 


where  m,  =  y,-  -  exp(x'/3NLS). 


5.8.6.  Weighted  NLS  and  LGNLS 

Lor  cross-section  data  the  errors  are  often  heteroskedastic.  Then  feasible  generalized 
NLS  that  controls  for  the  heteroskedasticity  is  more  efficient  than  NLS. 

Leasible  generalized  nonlinear  least  squares  (LGNLS)  is  still  generally  less  efficient 
than  ML.  The  notable  exception  is  that  LGNLS  is  asymptotically  equivalent  to  the 
MLE  when  the  conditional  density  for  y  is  an  LEE  density.  A  special  case  is  that  LGLS 
is  asymptotically  equivalent  to  the  MLE  in  the  linear  regression  under  normality. 
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Table  5.6.  Nonlinear  Least-Squares  Estimators  and  Their  Asymptotic  Variance a 

Estimator 

Objective  Function 

Estimated  Asymptotic  Variance 

NLS 

FGNLS 

WNLS 

Qn(( 3)  =  s^u'u 

Qn(P)  =  ^u'IKtT'u 
Qn(/3)  =  ^u'£  ’u 

(D'D)  1  D'$2D(D'D)” 1 
(D'O^'d)-1 

(D,£_1D)-1D,£_112£_1D(D,£_1D)-1. 

a  Functions  are  for  a  nonlinear  regression  model  with  error  u  =  y  —  g  defined  in  (5.70)  and  error  conditional  vari¬ 
ance  matrix  £2.  D  is  the  derivative  of  the  conditional  mean  vector  with  respect  to  (3r  evaluated  at  (3 .  For  FGNLS 
it  is  assumed  that  £2  is  consistent  for  £2.  For  NLS  and  WNLS  the  heteroskedastic  robust  variance  matrix  uses  £2 
equal  to  a  diagonal  matrix  with  squared  residuals  on  the  diagonals,  an  estimate  that  need  not  be  consistent  for  £2. 


If  heteroskedasticity  is  incorrectly  modeled  then  the  FGNLS  estimator  retains  con¬ 
sistency  but  one  should  then  obtain  standard  errors  that  are  robust  to  misspecification 
of  the  model  for  heteroskedasticity.  The  analysis  is  very  similar  to  that  for  the  linear 
model  given  in  Section  4.5. 

Feasible  Generalized  Nonlinear  Least  Squares 
The  feasible  generalized  nonlinear  least-squares  estimator  /3FGNLS  maximizes 

Qn(P)  =  -^(y  -  gWr'ly  -  g),  (5.84) 

where  it  is  assumed  that  Efuu'IX]  =  12(70)  and  7  is  a  consistent  estimate  7  of  70. 

If  the  assumptions  made  for  the  NLS  estimator  are  satisfied  and  in  fact  12o  =  12(70), 
then  the  FGNLS  estimator  is  consistent  and  asymptotically  normal  with  estimated 
asymptotic  variance  matrix  given  in  Table  5.6.  The  variance  matrix  estimate  is  similar 
to  that  for  linear  FGLS,  [X'12(7)_1X]  ' ,  except  that  X  is  replaced  by  D  =  ag/a/yfe. 

The  FGNLS  estimator  is  the  most  efficient  consistent  estimator  that  minimizes 
quadratic  loss  functions  of  the  form  (y  —  g)'V(y  —  g),  where  V  is  a  weighting  matrix. 

In  general,  implementation  of  FGNLS  requires  inversion  of  the  N  x  N  matrix 
12(7).  This  may  be  computationally  impossible  for  large  N,  but  in  practice  12(7)  usu¬ 
ally  has  a  structure,  such  as  diagonality,  that  leads  to  an  analytical  solution  for  the 
inverse. 


Weighted  NLS 

The  FGNLS  approach  is  fully  efficient  but  leads  to  invalid  standard  error  estimates  if 
the  model  for  12  0  is  misspecified.  Here  we  consider  an  approach  between  NLS  and 
FGNLS  that  specifies  a  model  for  the  variance  matrix  of  the  errors  but  then  obtains 
robust  standard  errors.  The  discussion  mirrors  that  in  Section  4.5.2. 

The  weighted  nonlinear  least  squares  (WNLS )  estimator  /3WNLS  maximizes 

Gv(/3)  =  -^(y-g)'r1(y-g),  (5.85) 

where  £  =  £(7)  is  a  working  error  variance  matrix,  £  =  £(7),  where  7  is  an 
estimate  of  7,  and,  in  a  departure  from  FGNLS,  £  /  12 q. 
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Under  assumptions  similar  to  those  for  the  NLS  estimator  and  assuming  that  So  = 
plim  S,  the  WNLS  estimator  is  consistent  and  asymptotically  normal  with  estimated 
asymptotic  variance  matrix  given  in  Table  5.6. 

This  estimator  is  called  WNLS  to  distinguish  it  from  FGNLS,  which  assumed  that 
S  =  I20.  The  WNLS  estimator  hopefully  lies  between  NLS  and  FGNLS  in  terms  of 
efficiency,  though  it  may  be  less  efficient  than  NLS  if  a  poor  model  of  the  error  vari¬ 
ance  matrix  is  chosen.  The  NLS  and  OLS  estimators  are  special  cases  of  WNLS  with 
X  =  ct2  I. 


Fleteroskedastic  Errors 


An  obvious  working  model  for  heteroskedasticity  is  cr2  =  E[m2|x,  ]  =  exp(z-70), 
where  the  vector  z  is  a  specified  function  of  x  (such  as  selected  subcomponents  of 
x)  and  using  the  exponential  ensures  a  positive  variance. 

Then  X  =  Diag[exp(z-7)]  and  X  =  Diag[exp(z-7)],  where  7  can  be  obtained  by 
nonlinear  regression  of  squared  NLS  residuals  (y,  —  g(xt,  /3NLS))2  on  exp(z-7).  Since 
X  is  diagonal,  X  1  =  Diag[l/cr2].  Then  (5.84)  simplifies  and  the  WNLS  estimator 
maximizes 


g(x,,  (3)f 

<72 


(5.86) 


The  variance  matrix  of  the  WNLS  estimator  given  in  Table  5.6  yields 


V[/3wnls]  — 


(5.87) 


where  d,  =  dg(x,-,  /3)/d(3\^  and  Ti)  =  y,  —  g(xt,  /3WNLS)  is  the  residual.  In  practice 
a  degrees  of  freedom  correction  may  be  used,  so  that  the  right-hand  side  of  (5.87) 
is  multiplied  by  N/(N  —  K).  If  the  stronger  assumption  is  made  that  X  =  I2(),  then 
WNLS  becomes  FGNLS  and 


V[/3fgnls1  — 


(5.88) 


The  WNLS  and  FGNLS  estimators  can  be  implemented  using  an  NLS  program. 
First,  do  NLS  regression  of  y,  on  «(x, .  (3).  Second,  obtain  7  by,  for  example,  NLS  re¬ 
gression  of  (yi  —  g(Xj,  /3nls))2  on  exp(zj7)  if  cry  =  exp(z'7).  Third,  perform  an  NLS 
regression  of  yt  jo,  on  g(xt,  /3)/cri,  where  er2  =  exp(z^7).  This  is  equivalent  to  max¬ 
imizing  (5.86).  White  robust  sandwich  standard  errors  from  this  transformed  regres¬ 
sion  give  robust  WNLS  standard  errors  based  on  (5.87).  The  usual  nonrobust  stan¬ 
dard  errors  from  this  transformed  regression  give  FGNLS  standard  errors  based  on 

(5.88). 

With  heteroskedastic  errors  it  is  very  tempting  to  go  one  step  further  and  attempt 
FGNLS  using  F2  =  Diag[7/2 1.  This  will  give  inconsistent  parameter  estimates  of  /30, 
however,  as  FGNLS  regression  of  y,  on  g(Xj,/3 )  then  reduces  to  NLS  regression 
of  y, / \Ti,  on  g(Xj,  f3)/\ui\.  The  technique  suffers  from  the  fundamental  problem  of 
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correlation  between  regressors  and  error  term.  Alternative  semiparametric  methods 
that  enable  an  estimator  as  efficient  as  feasible  GLS,  without  specifying  a  functional 
form  for  f2o,  are  presented  in  Section  9.7.6. 


Generalized  Linear  Models 

Implementation  of  the  weighted  NLS  approach  requires  a  reasonable  specification  for 
the  working  matrix.  A  somewhat  ad-hoc  approach,  already  presented,  is  to  let  of  = 
cxptz'7),  where  z  is  often  a  subset  of  x.  For  example,  in  regression  of  earnings  on 
schooling  and  other  control  variables  we  might  model  heteroskedasticity  more  simply 
as  being  a  function  of  just  a  few  of  the  regressors,  most  notably  schooling. 

Some  types  of  cross-section  data  provide  a  natural  model  for  heteroskedasticity 
that  is  very  parsimonious.  For  example,  for  count  data  the  Poisson  density  specifies 
that  the  variance  equals  the  mean,  so  of  =  g(x,  ,  (3).  This  provides  a  working  model 
for  heteroskedasticity  that  introduces  no  further  parameters  than  those  already  used  in 
modeling  the  conditional  mean. 

This  approach  of  letting  the  working  model  for  the  variance  be  a  function  of  the 
mean  arises  naturally  for  generalized  linear  models,  introduced  in  Sections  5.7.3  and 
5.7.4.  From  (5.63)  the  first-order  conditions  for  the  quasi-MLE  based  on  an  LEF  den¬ 
sity  are  of  the  form 

yi-g(xi,(3 )  3g(x,-,  (3) 

- ^ -  x  - =  0, 

af  df3 

where  of  =  [c'fgCx,- ,  (3))YX  is  the  assumed  variance  function  corresponding  to  the 
particular  GLM  (see  (5.60)).  For  example,  for  Poisson,  Bernoulli,  and  exponential 
distributions  of  equals,  respectively,  g,-,  g,(l  —  g,),  and  1  /  gj,  where  g,  =  gix, ,  (3). 

These  first-order  conditions  can  be  solved  for  (3  in  one  step  that  allows  for  depen¬ 
dence  of  of  on  (3.  In  a  simpler  two-step  method  one  computes  oj  =  c'(g(x, .  (3))  given 
an  initial  NLS  estimate  of  (3  and  then  does  a  weighted  NLS  regression  of  y,  /a,  on 
g(xi,  /3)/oj.  The  resulting  estimator  of  (3  is  asymptotically  equivalent  to  the  quasi- 
MLE  that  directly  solves  (5.63)  (see  Gourieroux,  Monfort,  and  Trognan  1984a,  or 
Cameron  and  Trivedi,  1986).  Thus  FGNLS  is  asymptotically  equivalent  to  ML  estima¬ 
tion  when  the  density  is  an  LEF  density.  To  guard  against  misspecification  of  of  infer¬ 
ence  is  based  on  robust  sandwich  standard  errors,  or  one  lets  oj  =  a[c'(g(x,- ,  f3))\  1 , 
where  the  estimate  a  is  given  in  Section  5.7.4. 


5.8.7,  Time  Series 

The  general  NLS  result  in  Proposition  5.6  applies  to  all  types  of  data,  including  time- 
series  data.  The  subsequent  results  on  variance  matrix  estimation  focused  on  the  cross- 
section  case  of  heteroskedastic  errors,  but  they  are  easily  adapted  to  the  case  of  time- 
series  data  with  serially  correlated  errors.  Indeed,  results  on  robust  variance  matrix 
estimation  using  spectral  methods  for  the  time-series  case  preceded  those  for  the  cross- 
section  case. 
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The  time-series  nonlinear  regression  model  is 

y,  =  g(xt,  f3)+ut,  t—  1 - ,  T. 

If  the  error  u,  is  serially  correlated  it  is  common  to  use  the  autoregressive  moving 
average  or  ARM  At  p.  q)  model 


u,  —  pi«/_i  +  •  •  •  +  PpUt-p  +  st  +  ot\S[— i  +  •  •  •  +  aqs,-q, 

where  st  is  iid  with  mean  0  and  variance  a2,  and  restrictions  may  be  placed  on  ARMA 
model  parameters  to  ensure  stationarity  and  invertibility.  The  ARMA  error  model  im¬ 
plies  a  particular  structure  to  the  error  variance  matrix  12  o  =  12(p,  a). 

The  ARMA  model  provides  a  good  model  for  $2o  in  the  time-series  case.  In  con¬ 
trast,  in  the  cross-section  case,  it  is  more  difficult  to  correctly  model  heteroskedasticity, 
leading  to  greater  emphasis  on  robust  inference  that  does  not  require  specification  of  a 
model  for  I2(l. 

What  if  errors  are  both  heteroskedastic  and  serially  correlated?  The  NLS  estimator 
is  consistent  though  inefficient  if  errors  are  serially  correlated,  provided  x,  does  not 
include  lagged  dependent  variables  in  which  case  it  becomes  inconsistent.  White  and 
Domowitz  (1984)  generalized  (5.79)  to  obtain  a  robust  estimate  of  the  variance  matrix 
of  the  NLS  estimator  given  heteroskedasticity  and  serial  correlation  of  unknown  func¬ 
tional  form,  assuming  serial  correlation  of  no  more  than  say,  /.  lags.  In  practice  a  minor 
refinement  due  to  Newey  and  West  (1987b)  is  used.  This  refinement  is  a  rescaling  that 
ensures  that  the  variance  matrix  estimate  is  semi-positive  definite.  Several  other  refine¬ 
ments  have  also  been  proposed  and  the  assumption  of  fixed  lag  length  has  been  relaxed 
so  that  it  is  possible  for  /  oo  at  a  sufficiently  slower  rate  than  N  — »■  oo.  This  permits 
an  AR  component  for  the  error. 


5.9.  Example:  ML  and  NLS  Estimation 

Maximum  likelihood  and  NLS  estimation,  standard  error  calculation,  and  coefficient 
interpretation  are  illustrated  using  simulation  data. 

5.9.1.  Model  and  Estimators 

The  exponential  distribution  is  used  for  continuous  positive  data,  notably  duration  data 
studied  in  Chapter  17.  The  exponential  density  is 

f(y)  —  Xe~Xy,  y  >  0,  X  >  0, 

with  mean  1  /X  and  variance  1  /X2.  We  introduce  regressors  into  this  model  by  setting 

X  =  exp(x'/3), 

which  ensures  X  >  0.  Note  that  this  implies  that 

E[y|x]  =  exp(— x'/3). 
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An  alternative  parameterization  instead  specifies  E[y|x]  =  exp(x'/3),  so  that  X  = 
exp(— x'/3).  Note  that  the  exponential  is  used  in  two  different  ways:  for  the  density 
and  for  the  conditional  mean. 

The  OLS  estimator  from  regression  of  y  on  x  is  inconsistent,  since  it  fits  a  straight 
line  when  the  regression  function  is  in  fact  an  exponential  curve. 

The  MLE  is  easily  obtained.  The  log-density  is  In  /(y|x)  =  x'/3  —  y  exp (x'/3),  lead¬ 
ing  to  ML  first-order  conditions  N  1  £T(1  —  y,  exp(x'/3))x,  =  0,  or 


AT 


-  exp (~x'(3) 

- x,  =  0 

exp(— x!j3) 


To  perform  NLS  regression,  note  that  the  model  can  also  be  expressed  as  a  nonlinear 
regression  with 


y  =  exp(— x'/3)  +  u, 


where  the  error  term  u  has  E[m|x]  =  0,  though  it  is  heteroskedastic.  The  first-order 
conditions  for  an  exponential  conditional  mean  for  this  model,  aside  from  a  sign  rever¬ 
sal,  have  already  been  given  in  (5.82)  and  clearly  lead  to  an  estimator  that  differs  from 
the  MLE. 

As  an  example  of  weighted  NLS  we  suppose  that  the  error  variance  is  propor¬ 
tional  to  the  mean.  Then  the  working  variance  is  V[v]  =  E[y]  and  weighted  least 
squares  can  be  implemented  by  NLS  regression  of  yy  fo,  on  exp(— x./3)/oy,  where 
a2  =  exp(— x./3nls).  This  estimator  is  less  efficient  than  the  MLE  and  may  or  may  not 
be  more  efficient  than  NLS. 

Leasible  generalized  NLS  can  be  implemented  here,  since  we  know  the  dgp. 
Since  V [  y  |  =  I  /)}  for  the  exponential  density,  so  the  variance  equals  the  mean 
squared,  it  follows  that  V[m|x]  =  [exp(— x'/3)]2.  The  LGNLS  estimator  estimates  a 2 
by  a2  =  [exp(— x./3nls)]2  and  can  be  implemented  by  NLS  regression  of  yy /a,  on 
exp(— x^/3)/ct,  .  In  general  LGNLS  is  less  efficient  than  the  MLE.  In  this  example  it  is 
actually  fully  efficient  as  the  exponential  density  is  an  LEE  density  (see  the  discussion 
at  the  end  of  Section  5.8.6). 


5.9.2.  Simulation  and  Results 

Lor  simplicity  we  consider  regression  on  an  intercept  and  a  regressor.  The  data- 
generating  process  is 

y|x  ~  exponential[k], 

X  =  sxp(fa  +  fax), 

where  x  ~  A/”[l,  l2]  and  (fa,  fa)  =  (2,  —  1).  A  large  sample  of  size  10,000  was  drawn 
to  minimize  differences  in  estimates,  particularly  standard  errors,  arising  from  sam¬ 
pling  variability.  Lor  the  particular  sample  of  10,000  drawn  here  the  sample  mean  of 
y  is  0.62  and  the  sample  standard  deviation  of  y  is  1.29. 

Table  5.7  presents  OLS,  ML,  NLS,  WNLS,  and  LGNLS  estimates.  Up  to  three 
different  standard  error  estimates  are  also  given.  The  default  regression  output  yields 
nonrobust  standard  errors,  given  in  parentheses.  Lor  OLS  and  NLS  estimators  these 
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Table  5.7.  Exponential  Example:  Least-Squares  and  ML  Estimates a 


Estimator 

Variable 

OLS 

ML 

NLS 

WNLS 

FGNLS 

Constant 

-0.0093 

1.9829 

1.8876 

1.9906 

1.9840 

(0.0161) 

(0.0141) 

(0.0307) 

(0.0225) 

(0.0148) 

[0.0172] 

[0.0144] 

[0.1421] 

[0.2110] 

[0.0359] 

[0.0146] 

X 

0.6198 

-0.9896 

-0.9575 

-0.9961 

-0.9907 

(0.0113) 

(0.0099) 

(0.0097) 

(0.0098) 

(0.0100) 

[0.0254] 

[0.0099] 

[0.0612] 

[0.0880] 

[0.0224] 

[0.0101] 

InL 

- 

-208.71 

-232.98 

-208.93 

-208.72 

R2 

0.2326 

0.3906 

0.3913 

0.3902 

0.3906 

a  All  estimators  are  consistent,  aside  from  OLS.  Up  to  three  alternative  standard  error  estimates  are  given: 
nonrobust  in  parentheses,  robust  outer  product  in  square  brackets,  and  an  alternative  robust  estimate  for  NLS 
in  braces.  The  conditional  dgp  is  an  exponential  distribution  with  intercept  2  and  slope  parameter  —1.  Sample 
size  N  =  10,000. 


assume  iid  errors,  an  erroneous  assumption  here,  and  for  the  MLE  these  impose  the 
IM  equality,  a  valid  assumption  here  since  the  assumed  density  is  the  dgp.  The  robust 
standard  errors,  given  in  square  brackets,  use  the  robust  sandwich  variance  estimate 
N  " 1  Ah  1  Hoi>An  1 ,  where  Bop  is  the  outer  product  estimated  given  in  (5.38).  These 
estimates  are  heteroskedastic  consistent.  For  standard  errors  of  the  NLS  estimator  an 
alternative  better  estimate  is  given  in  braces  (and  is  explained  in  the  next  section).  The 
standard  error  estimates  presented  here  use  numerical  rather  than  analytical  derivatives 
in  computing  A  and  B. 


5.9.3.  Comparison  of  Estimates  and  Standard  Errors 

The  OLS  estimator  is  inconsistent,  yielding  estimates  unrelated  to  (fJ>\ ,  /T)  in  the  ex¬ 
ponential  dgp. 

The  remaining  estimators  are  consistent,  and  the  ML,  NLS,  WNLS,  and  FGNLS 
estimators  are  within  two  standard  errors  of  the  true  parameter  values  of  (2,  —  1),  where 
the  robust  standard  errors  need  to  be  used  for  NLS.  The  FGNLS  estimates  are  quite 
close  to  the  ML  estimates,  a  consequence  of  using  a  dgp  in  the  LEF. 

For  the  MLE  the  nonrobust  and  robust  ML  standard  errors  are  quite  similar.  This  is 
expected  as  they  are  asymptotically  equivalent  (since  the  information  matrix  equality 
holds  if  the  MLE  is  based  on  the  true  density)  and  the  sample  size  here  is  large. 

For  NLS  the  nonrobust  standard  errors  are  invalid,  because  the  dgp  has  het¬ 
eroskedastic  errors,  and  greatly  overstate  the  precision  of  the  NLS  estimates.  The  for¬ 
mula  for  the  robust  variance  matrix  estimate  for  NLS  is  given  in  (5.81),  where  $2  = 
Diag[i?2].  An  alternative  that  uses  12  =  Diag[E  [w2]],  where  E  \uf\  =  [exp(— xJ/3)]2, 
is  given  in  braces.  The  two  estimates  differ:  0.0612  compared  to  0.0880  for  the 
slope  coefficient.  The  difference  arises  because  'uj  =  (v,  —  exp(x'/4))2  differs  from 
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[exp(— x'/3)]2.  More  generally  standard  errors  estimated  using  the  outer  product  (see 
Section  5.5.2)  can  be  biased  even  in  quite  large  samples.  NLS  is  considerably  less  effi¬ 
cient  than  MLE,  with  standard  errors  many  times  those  of  the  MLE  using  the  preferred 
estimates  in  braces. 

The  WNLS  estimator  does  not  use  the  correct  model  for  heteroskedasticity,  so  the 
nonrobust  and  robust  standard  errors  again  differ.  Using  the  robust  standard  errors  the 
WNLS  estimator  is  more  efficient  than  NLS  and  less  efficient  than  the  MLE. 

In  this  example  the  FGNLS  estimator  is  as  efficient  as  the  MLE,  a  consequence 
of  the  known  dgp  being  in  the  LEE  The  results  indicate  this,  with  coefficients  and 
standard  errors  very  close  to  those  for  the  MLE.  The  robust  and  nonrobust  standard 
errors  for  the  FGNLS  estimator  are  essentially  the  same,  as  expected  since  here  the 
model  for  heteroskedasticity  is  correctly  specified. 

Table  5.7  also  reports  the  estimated  log-likelihood,  lnL  =  JT[x-/3  — 
exp(-x'/3)y,],  and  an  R-squared  measure,  R2  =  1  —  ^.(y,-  -  y*)2/  £T(y/  -  y)2, 
where  y)  =  exp(— x-/3),  evaluated  at  the  ML,  NLS,  WNLS,  and  FGNLS  estimates. 
The  R2  differs  little  across  models  and  is  lowest  for  the  NLS  estimator,  as  expected 
since  NLS  minimizes  £T(y,-  —  y,  )2.  The  log-likelihood  is  maximized  by  the  MLE,  as 
expected,  and  is  considerably  lower  for  the  NLS  estimator. 

5.9.4.  Coefficient  Interpretation 

Interest  lies  in  changes  in  E[y|.r]  when  x  changes.  We  consider  the  ML  esdmates  of 
ft2  =  —0.99  given  in  Table  5.7. 

The  conditional  mean  exp(— ft\  —  ft2x)  is  of  single-index  form,  so  that  if  an  ad¬ 
ditional  regressor  z  with  coefficient  ft]  were  included,  then  the  marginal  effect  of  a 
one-unit  change  in  z  would  be  ft 2/  ft  2  times  that  of  a  one-unit  change  in  x  (see  Sec¬ 
tion  5.2.4). 

The  conditional  mean  is  monotonically  decreasing  in  x,  so  the  sign  of  ft  2  is  the  re¬ 
verse  of  the  marginal  effect  (see  Section  5.2.4).  Here  the  marginal  effect  of  an  increase 
in  x  is  an  increase  in  the  conditional  mean,  since  ft2  is  negative. 

We  now  consider  the  magnitude  of  the  marginal  effect  of  changes  in  a  using  cal¬ 
culus  methods.  Here  3E[y|x]/3x  =  —f}2  cxp(— x'/3)  varies  with  the  evaluation  point 
x  and  ranges  from  0.01  to  19.09  in  the  sample.  The  sample-average  response  is 
0.99./W1  JT  exp(X;/3)  =  0.61.  The  response  evaluated  at  the  sample  mean  of  x, 
0.99  exp(x'/3)  =  0.37,  is  considerably  smaller.  Since  3E[y|x]/3x  =  —  #>E[y|x],  yet 
another  estimate  of  the  marginal  effect  is  0.99y  =  0.61. 

Finite-difference  methods  lead  to  a  different  estimated  marginal  effect.  For  Ax  =  1 
we  obtain  AE[y|x]  =  ( e —  l)exp(— x'/3)  (see  Section  5.2.4).  This  yields  an  average 
response  over  the  sample  of  1.04,  rather  than  0.61.  The  finite-difference  and  calculus 
methods  coincide,  however,  if  Ax  is  small. 

The  preceding  marginal  effects  are  additive.  For  the  exponential  conditional  mean 
we  can  also  consider  multiplicative  or  proportionate  marginal  effects  (see  Sec¬ 
tion  5.2.4).  For  example,  a  0.1-unit  change  in  x  is  predicted  to  lead  to  a  proportionate 
increase  in  E[y|x]  of  0.1  x  0.99  or  a  9.9%  increase.  Again  a  finite-difference  approach 
will  yield  a  different  estimate. 


162 


5.11.  BIBLIOGRAPHIC  NOTES 


Which  of  these  measures  is  most  useful?  The  restriction  to  single-index  form  is 
very  useful  as  the  relative  impact  of  regressors  can  be  immediately  calculated.  For  the 
magnitude  of  the  response  it  is  most  accurate  to  compute  the  average  response  across 
the  sample,  using  noncalculus  methods,  of  a  c-unit  change  in  the  regressor,  where 
the  magnitude  of  c  is  a  meaningful  amount  such  as  a  one  standard  deviation  change 
in  x. 

Similar  calculations  can  be  done  for  the  NLS,  WNLS,  and  FGNLS  estimates,  with 
similar  results.  For  the  OLS  estimator,  note  that  the  coefficient  of  x  can  be  interpreted 
as  giving  the  sample-average  marginal  effect  of  a  change  in  x  (see  Section  4.7.2).  Flere 
the  OLS  estimate  p2  =  0.61  equals  to  two  decimal  places  the  sample-average  response 
computed  earlier  using  the  exponential  MLE.  Here  OLS  provides  a  good  estimate  of 
the  sample-average  marginal  response,  even  though  it  can  provide  a  very  poor  estimate 
of  the  marginal  response  for  any  particular  value  of  x. 


5.10.  Practical  Considerations 

Most  econometrics  packages  provide  simple  commands  to  obtain  the  maximum  like¬ 
lihood  estimators  for  the  standard  models  introduced  in  Section  5.6.1.  For  other  den¬ 
sities  many  packages  provide  an  ML  routine  to  which  the  user  provides  the  equation 
for  the  density  and  possibly  first  derivatives  or  even  second  derivatives.  Similarly,  for 
NLS  one  provides  the  equation  for  the  conditional  mean  to  an  NLS  routine.  For  some 
nonlinear  models  and  data  sets  the  ML  and  NLS  routines  provided  in  packages  can  en¬ 
counter  computational  difficulties  in  obtaining  estimates.  In  such  circumstances  it  may 
be  necessary  to  use  more  robust  optimization  routines  provided  as  add-on  modules  to 
Gauss,  Matlab  and  OX.  Gauss,  Matlab  and  OX  are  better  tools  for  nonlinear  modeling, 
but  require  a  higher  initial  learning  investment. 

For  cross-section  data  it  is  becoming  standard  to  use  standard  errors  based  on  the 
sandwich  form  of  the  variance  matrix.  These  are  often  provided  as  a  command  option. 
For  LS  estimators  this  gives  heteroskedastic-consistent  standard  errors.  For  maximum 
likelihood  one  should  be  aware  that  misspecification  of  the  density  can  lead  to  incon¬ 
sistency  in  addition  to  requiring  the  use  of  sandwich  errors. 

The  parameters  of  nonlinear  models  are  usually  not  directly  interpretable,  and  it  is 
good  practice  to  additionally  compute  the  implied  marginal  effects  caused  by  changes 
in  regressors  (see  Section  5.2.4).  Some  packages  do  this  automatically;  for  others  sev¬ 
eral  lines  of  postestimation  code  using  saved  regression  coefficients  may  be  needed. 

5.11.  Bibliographic  Notes 

A  brief  history  of  the  development  of  asymptotic  theory  results  for  extremum  estimators  is 
given  in  Newey  and  McFadden  (1994,  p.  2115).  A  major  econometrics  advance  was  made 
by  Amemiya  (1973),  who  developed  quite  general  theorems  that  were  applied  to  the  Tobit 
model  MLE.  Useful  book-length  treatments  include  those  by  Gallant  (1987),  Gallant  and  White 
(1987),  Bierens  (1993),  and  White  (1994,  2001a).  Statistical  foundations  are  given  in  many 
books,  including  Amemiya  (1985,  Chapter  3),  Davidson  and  MacKinnon  (1993,  Chapter  4), 
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Greene  (2003,  appendix  D),  Davidson  (1994),  and  Zaman  (1996). 

5.3  The  presentation  of  general  extremum  estimation  results  draws  heavily  on  Amemiya  (1985, 
Chapter  4),  and  to  a  lesser  extent  on  Newey  and  McFadden  (1994).  The  latter  reference  is 
very  comprehensive. 

5.4  The  estimating  equations  approach  is  used  in  the  generalized  linear  models  literature  (see 
McCullagh  and  Nelder,  1989).  Econometricians  subsume  this  in  generalized  method  of 
moments  (see  Chapter  6). 

5.5  Statistical  inference  is  presented  in  detail  in  Chapter  7. 

5.6  See  the  pioneering  article  by  Fisher  (1922)  for  general  results  for  ML  estimation,  including 
efficiency,  and  for  comparison  of  the  likelihood  approach  with  the  inverse-probability  or 
Bayesian  approach  and  with  method  of  moments  estimation. 

5.7  Modem  applications  frequently  use  the  quasi-ML  framework  and  sandwich  estimates  of 
the  variance  matrix  (see  White,  1982,  1994).  In  statistics  the  approach  is  called  generalized 
linear  models,  with  McCullagh  and  Nelder  (1989)  a  standard  reference. 

5.8  Similarly  for  NLS  estimation,  sandwich  estimates  of  the  variance  matrix  are  used  that  re¬ 
quire  relatively  weak  assumptions  on  the  error  process.  The  papers  by  White  (1980a, c)  had 
a  big  impact  on  statistical  inference  in  econometrics.  Generalization  and  a  detailed  review 
of  the  asymptotic  theory  is  given  in  White  and  Domowitz  (1984).  Amemiya  (1983)  has 
extensively  surveyed  methods  for  nonlinear  regression. 


- Exercises - 

5-1  Suppose  we  obtain  model  estimates  that  yield  predicted  conditional  mean 
E[y|x]  =  exp(1  +  0.01x)/[1  +  exp(1  +  0.01  x)].  Suppose  the  sample  is  of  size  100 
and  x  takes  integer  values  1,  2, ....  100.  Obtain  the  following  estimates  of  the 
estimated  marginal  effect  9E[y|x]/9x. 

(a)  The  average  marginal  effect  over  all  observations. 

(b)  The  marginal  effect  of  the  average  observation. 

(c)  The  marginal  effect  when  x  =  90. 

(d)  The  marginal  effect  of  a  one-unit  change  when  x=  90,  computed  using  the 
finite-difference  method. 

5-2  Consider  the  following  special  one-parameter  case  of  the  gamma  distribution, 
f(y)  =  (y/X2)exp(-y/X),  y  >  0,  X  >  0.  For  this  distribution  it  can  be  shown  that 
E[y]  =  2X  and  V[y]  =  2X2.  Here  we  introduce  regressors  and  suppose  that  in  the 
true  model  the  parameter  ^  depends  on  regressors  according  to  X,  =  exp(x'/3)/2. 
Thus  E[y|x,]  =  exp(x-/3)  and  V[y  |x,]  =  [exp(xJ/3)]2/2.  Assume  the  data  are  inde¬ 
pendent  over  /  and  x,  is  nonstochastic  and  /3  =  /30  in  the  dgp. 

(a)  Show  that  the  log-likelihood  function  (scaled  by  AG1)  for  this  gamma  model 
is  Qn((3)  =  A/"1  £/  {In  y  -  2x{/3  +  2 In 2  -  2y  exp(-x;/3)}. 

(b)  Obtain  plim  Qw(/3).  You  can  assume  that  assumptions  for  any  LLN  used  are 
satisfied.  [Hint:  E[ln  y]  depends  on  f30  but  not  /3.] 

(c)  Prove  that  (3  that  is  the  local  maximum  of  Qw(/3)  is  consistent  for  /30.  State 
any  assumptions  made. 

(d)  Now  state  what  LLN  you  would  use  to  verify  part  (b)  and  what  additional 
information,  if  any,  is  needed  to  apply  this  law.  A  brief  answer  will  do.  There 
is  no  need  for  a  formal  proof. 
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5-3  Continue  with  the  gamma  model  of  Exercise  5-2. 

(a)  Show  that  9  QN((3)/d(3  =  AC1  2[(y  -  exp(x</3))/exp(xt/3)]x/. 

(b)  What  essential  condition  indicated  by  the  first-order  conditions  needs  to  be 
satisfied  for  f3  to  be  consistent? 

(c)  Apply  a  central  limit  theorem  to  obtain  the  limit  distribution  of  */N3  QN/df3\p0. 
Here  you  can  assume  that  the  assumptions  necessary  for  a  CLT  are  satisfied. 

(d)  State  what  CLT  you  would  use  to  verify  part  (c)  and  what  additional  informa¬ 
tion,  if  any,  is  needed  to  apply  this  law.  A  brief  answer  will  do.  There  is  no 
need  for  a  formal  proof. 

(e)  Obtain  the  probability  limit  of  32  QN/df3d/3'\p0. 

(f)  Combine  the  previous  results  to  obtain  the  limit  distribution  of  V~N(/3  -  (30). 

(g)  Given  part  (f),  state  how  to  test  H0  :  p0  i  >  P]  against  Ha  :  p0j  <  P*  at  level 
0.05,  where  p y  is  the  y'th  component  of  f3. 

5-4  A  nonnegative  integer  variable  y  that  is  geometric  distributed  has  density  (or 
more  formally  probability  mass  function)  f(y)  =  (y+  1)(27)y(1  +  2A)_(y+a5),  y  = 
0, 1 , 2, . . . ,  X  >  0.  Then  E[y]  =  X  and  V[y]  =  A(1  +  27).  Introduce  regressors  and 
suppose  yi  =  exp(x'/3).  Assume  the  data  are  independent  over  /  and  x,  is  non¬ 
stochastic  and  (3  =  (30  in  the  dgp. 

(a)  Repeat  Exercise  5-2  for  this  model. 

(b)  Repeat  Exercise  5-3  for  this  model. 

5-5  Suppose  a  sample  yields  estimates  0i  =  5,?2  =  3,  se[0i]  =  2,  and  se[02]  =  1  and 
the  correlation  coefficient  between  ?i  and  02  equals  0.5.  Perform  the  following 
tests  at  level  0.05,  assuming  asymptotic  normality  of  the  parameter  estimates. 

(a)  Test  H0  :  0i  =  0  against  Ha  :  0i  #  0. 

(b)  Test  H0  :  0i  =  2 02  against  Ha  :  0i  #  202. 

(c)  Test  H0  :  0i  =  0,  02  =  0  against  Ha  :  at  least  one  of  0i ,  02  ^  0. 

5-6  Consider  the  nonlinear  regression  model  y  =  exp  (x'/3)/[1  +  exp  (x'/3)]  +  u,  where 
the  error  term  is  possibly  heteroskedastic. 

(a)  Within  what  range  does  this  restrict  E[y|x]  to  lie? 

(b)  Give  the  first-order  conditions  for  the  NLS  estimator. 

(c)  Obtain  the  asymptotic  distribution  of  the  NLS  estimator  using  result  (5.77). 

5-7  This  question  presumes  access  to  software  that  allows  NLS  and  ML  estimation. 
Consider  the  gamma  regression  model  of  Exercise  5-2.  An  appropriate  gamma 
variate  can  be  generated  using  y=  -Alnri  -7  In  r2,  where  X  =  exp(x'/3)/2  and 
fi  and  r2  are  random  draws  from  Uniform[0, 1].  Let  x'/3  =  p-\  +  P2 x.  Generate  a 
sample  of  size  1 . 000  when  ^  =—1.0  and  p2  =  1  and  x  ~AC[0, 1  ]. 

(a)  Obtain  estimates  of  /Si  and  p2  from  NLS  regression  of  yon  exp(/0i  +  p2x). 

(b)  Should  sandwich  standard  errors  be  used  here? 

(c)  Obtain  ML  estimates  of  and  p2  from  NLS  regression  of  yon  exp(/0i  +  p2x). 

(d)  Should  sandwich  standard  errors  be  used  here? 
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CHAPTER  6 


Generalized  Method  of  Moments 
and  Systems  Estimation 


6.1.  Introduction 

The  previous  chapter  focused  on  m-estimation,  including  ML  and  NLS  estimation. 
Now  we  consider  a  much  broader  class  of  extremum  estimators,  those  based  on  method 
of  moments  (MM)  and  generalized  method  of  moments  (GMM). 

The  basis  of  MM  and  GMM  is  specification  of  a  set  of  population  moment  condi¬ 
tions  involving  data  and  unknown  parameters.  The  MM  estimator  solves  the  sample 
moment  conditions  that  correspond  to  the  population  moment  conditions.  For  exam¬ 
ple,  the  sample  mean  is  the  MM  estimator  of  the  population  mean.  In  some  cases  there 
may  be  no  explicit  analytical  solution  for  the  MM  estimator,  but  numerical  solution 
may  still  be  possible.  Then  the  estimator  is  an  example  of  the  estimating  equations 
estimator  introduced  briefly  in  Section  5.4. 

In  some  situations,  however,  MM  estimation  may  be  infeasible  because  there  are 
more  moment  conditions  and  hence  equations  to  solve  than  there  are  parameters.  A 
leading  example  is  IV  estimation  in  an  overidentified  model.  The  GMM  estimator,  due 
to  Hansen  (1982),  extends  the  MM  approach  to  accommodate  this  case. 

The  GMM  estimator  defines  a  class  of  estimators,  with  different  GMM  estimators 
obtained  by  using  different  population  moment  conditions,  just  as  different  specified 
densities  lead  to  different  ML  estimators.  We  emphasize  this  moment-based  approach 
to  estimation,  even  in  cases  where  alternative  presentations  are  possible,  as  it  provides 
a  unified  approach  to  estimation  and  can  provide  an  obvious  way  to  extend  methods 
from  linear  to  nonlinear  models. 

The  basics  of  GMM  estimation  are  given  in  Sections  6.2  and  6.3,  which  present, 
respectively,  expository  examples  and  asymptotic  results  for  statistical  inference.  The 
remainder  of  the  chapter  details  more  specialized  estimators.  Instrumental  variables 
estimators  are  presented  in  Sections  6.4  and  6.5.  For  linear  models  the  treatment  in 
Sections  4.8  and  4.9  may  be  sufficient,  but  extension  to  nonlinear  models  uses  the 
GMM  approach.  Section  6.6  covers  methods  to  compute  standard  errors  of  sequential 
two-step  m-estimators.  Sections  6.7  and  6.8  present  the  minimum  distance  estimator, 
a  variant  of  GMM,  and  the  empirical  likelihood  estimator,  an  alternative  estimator  to 
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GMM.  Systems  estimation  methods,  used  in  a  relatively  small  fraction  of  microecono¬ 
metrics  studies,  are  discussed  in  Sections  6.9  and  6.10. 

This  chapter  reviews  many  estimation  methods  from  a  GMM  perspective.  Applica¬ 
tions  of  these  methods  to  actual  data  include  a  linear  IV  application  in  Section  4.9.6 
and  a  linear  panel  GMM  application  in  Section  22.3. 


6.2.  Examples 

GMM  estimators  are  based  on  the  analogy  principle  (see  Section  5.4.2)  that  population 
moment  conditions  lead  to  sample  moment  conditions  that  can  be  used  to  estimate 
parameters.  This  section  provides  several  leading  applications  of  this  principle,  with 
properties  of  the  resulting  estimator  deferred  to  Section  6.3. 


6.2.1.  Linear  Regression 

A  classic  example  of  method  of  moments  is  estimation  of  the  population  mean  when 
y  is  iid  with  mean  ji.  In  the  population 

E [y  -  /i]  =  0. 

Replacing  the  expectations  operator  E[-]  for  the  population  by  the  average  operator 
N  1  E/LiO  for  the  sample  yields  the  corresponding  sample  moment 

jf  !>'■  -  &  =  °- 

i=i 

Solving  for  /j.  leads  to  the  estimator  /2MM  =  N'  1  E;  y,  =  y.  The  MM  estimate  of  the 
population  mean  is  the  sample  mean. 

This  approach  can  be  extended  to  the  linear  regression  model  y  =  x'/3  +  u.  where 
x  and  (3  are  K  x  1  vectors.  Suppose  the  error  term  u  has  zero  mean  conditional  on 
regressors.  The  single  conditional  moment  restriction  E[n|x]  =  0  leads  to  K  uncondi¬ 
tional  moment  conditions  E[xm]  =  0,  since 

E[xw]  =  Ex[E[xn|x]]  =  Ex[xE[n|x]]  =  Ex[x-0]  =  0,  (6.1) 

using  the  law  of  iterated  expectations  (see  Section  A.  8)  and  the  assumption  that 
E[h|x]  =  0.  Thus 

E[x(>-  -  x'(3)]  =  0, 

if  the  error  has  conditional  mean  zero.  The  MM  estimator  is  the  solution  to  the  corre¬ 
sponding  sample  moment  condition 

1  £>(yi- -  x;/3)  =  °. 
i= 1 

This  yields  (3MM  =  QE-  x/xj)-1  E/  VAi- 
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The  OLS  estimator  is  therefore  a  special  case  of  MM  estimation.  The  MM  deriva¬ 
tion  of  the  OLS  estimator,  however,  differs  significantly  from  the  usual  one  of  mini¬ 
mization  of  a  sum  of  squared  residuals. 


6.2.2.  Nonlinear  Regression 

For  nonlinear  regression  the  method  of  moments  approach  reduces  to  NLS  if  regres¬ 
sion  eiTors  are  additive.  For  more  general  nonlinear  regression  with  nonadditive  errors 
(defined  in  the  following)  method  of  moments  yields  a  consistent  estimator  whereas 
NLS  is  inconsistent. 

From  Section  5.8.3  the  nonlinear  regression  model  with  additive  error  is  a  model 
that  specifies 

y  =  g(x,  (3)  +  u. 

A  moment  approach  similar  to  that  for  the  linear  model  yields  that  E[m|x]  =  0  im¬ 
plies  that  E[h(x)(y  —  x'/3j]  =  0,  where  h(x)  is  any  function  of  x.  The  particular  choice 
h(x)  =  3 g(x,  (3)/d(3,  motivated  in  Section  6.3.7,  leads  to  corresponding  sample  mo¬ 
ment  condition  that  equals  the  first-order  conditions  for  the  NLS  estimator  given  in 
Section  5.8.2. 

The  more  general  nonlinear  regression  model  with  nonadditive  error  specifies 

u  =  r(y,  x,  (3), 


where  again  E[«|x]  =  0  but  now  y  is  no  longer  restricted  to  being  an  additive  func¬ 
tion  of  u.  For  example,  in  Poisson  regression  one  may  define  the  standardized  error 
u  =  [y  —  exp (x'/^l/lexp (x'/3)] 1  that  has  E[m|x]  =  0  and  V[n|x]  =  1  since  y  has 
conditional  mean  and  variance  equal  to  exp  (x  [3). 

The  NLS  estimator  is  inconsistent  given  nonadditive  error.  Minimizing 
N  1  JT  u2  =  N  1  JT  r(yi,Xj,  (3)2  leads  to  first-order  conditions 


1  dr(yi ,  x,- ,  [3) 

n  z — T5 — Kyi  >  N  -  P)  =  0 


i=  1 


3/3 


Flere  y,  appears  in  both  terms  in  the  product  and  there  is  no  guarantee  that  this  prod¬ 
uct  has  expected  value  of  zero  even  if  E[r(-)|x]  =  0.  This  inconsistency  did  not  arise 
with  additive  errors  r(-)  =  y  —  g(x,  (3),  as  then  3r(-)/ 3/3  =  —3 g(x,  (3)/d(3,  so  only 
the  second  term  in  the  product  depended  on  y. 

A  moment-based  approach  yields  a  consistent  estimator.  The  assumption  that 
E[h|x]  =  0  implies 


E[h(x)r(y,  x,  (3)]  =  0, 


where  h(x)  is  a  function  of  x.  If  dim[h(x)]  =  K  then  the  corresponding  sample  mo¬ 
ment 

1  N 

—  ^2  h(xj  )r(y, ,  x; ,  (3)  =  0 

^  i= l 

yields  a  consistent  estimate  of  (3,  where  solution  is  by  numerical  methods. 
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6.2.3.  Maximum  Likelihood 

The  Kullback-Leibler  information  criterion  was  defined  in  Section  5.7.2.  From 
this  definition,  a  local  maximum  of  KLIC  occurs  if  E[s(0)]  =  0,  where  s(0)  = 
3  In  /(y|x,  9)/d6  and  f(y |x,  6)  is  the  conditional  density. 

Replacing  population  moments  by  sample  moments  yields  an  estimator  6  that 
solves  N  1  JT  Sj(6)  =  0.  These  are  the  ML  hrst-order  conditions,  so  the  MLE  can 
be  motivated  as  an  MM  estimator. 


6.2.4.  Additional  Moment  Restrictions 

Using  additional  moments  can  improve  the  efficiency  of  estimation  but  requires  adap¬ 
tation  of  regular  method  of  moments  if  there  are  more  moment  conditions  than  param¬ 
eters  to  estimate. 

A  simple  example  of  an  inefficient  estimator  is  the  sample  mean.  This  is  an  ineffi¬ 
cient  estimator  of  the  population  mean  unless  the  data  are  a  random  sample  from  the 
normal  distribution  or  some  other  member  of  the  exponential  family  of  distributions. 
One  way  to  improve  efficiency  is  to  use  alternative  estimators.  The  sample  median, 
consistent  for  fi  if  the  distribution  is  symmetric,  may  be  more  efficient.  Obviously  the 
MLE  could  be  used  if  the  distribution  is  fully  specified,  but  here  we  instead  improve 
efficiency  by  using  additional  moment  restrictions. 

Consider  estimation  of  (3  in  the  linear  regression  model.  The  OLS  estimator  is  in¬ 
efficient  even  assuming  homoskedastic  errors,  unless  errors  are  normally  distributed. 
From  Section  6.2.1,  the  OLS  estimator  is  an  MM  estimator  based  on  E[xn 1  =  0.  Now 
make  the  additional  moment  assumption  that  errors  are  conditionally  symmetric,  so 
that  E[h3|x]  =  0  and  hence  E[xw3]  =  0.  Then  estimation  of  (3  may  be  based  on  the 
2  K  moment  conditions 


'  E[x(y  -  x'/3)]  ■ 

O' 

_E[x(y  —  x'/3)3]  _ 

0 

The  MM  estimator  would  attempt  to  estimate  (3  as  the  solution  to  the  corresponding 
sample  moment  conditions  N  1  x —  x-/3)  =  0  and  N  1  JT  x,iy,  —  xJ/3)3  =  0. 
However,  with  2  K  equations  and  only  K  unknown  parameters  (3,  it  is  not  possible  for 
all  of  these  sample  moment  conditions  to  be  satisfied. 

The  GMM  estimator  instead  sets  the  sample  moments  as  close  to  zero  as  possible 
using  quadratic  loss.  Then  /3GMM  minimizes 


Qn(P) 


N  Ei  XiUi 
.  it  Ei  Xffi3 


wA 


^Eixffi 
LfE/X/m-J 


(6.2) 


where  n,  =  y,  —  x-/3  and  W,v  is  a  2 K  x  2 K  weighting  matrix.  For  some  choices 
of  Wjv  this  estimator  is  more  efficient  than  OLS.  This  example  is  analyzed  in  Sec¬ 
tion  6.3.6. 
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6.2.5.  Instrumental  Variables  Regression 

Instrumental  variables  estimation  is  a  leading  example  of  generalized  method  of  mo¬ 
ments  estimation. 

Consider  the  linear  regression  model  y  =  \'/3  +  u.  with  the  complication  that  some 
components  of  x  are  correlated  with  the  error  term  so  that  OLS  is  inconsistent  for  (3. 
Assume  the  existence  of  instruments  z  (introduced  in  Section  4.8)  that  are  correlated 
with  x  but  satisfy  E[m|z]  =  0.  Then  E[y  —  x'/3|z]  =  0.  Using  algebra  similar  to  that 
used  to  obtain  (6.1)  for  the  OLS  example,  we  multiply  by  z  to  get  the  K  unconditional 
population  moment  conditions 

E[z(y  -  x'/3)]  =  0.  (6.3) 


The  method  of  moments  estimator  solves  the  corresponding  sample  moment  condition 


i^z!(yI-x:/3)  =  °. 

i= 1 


If  dim(z)  =  K  this  yields  /3MM  =  (£V  z,x-)_1  JT  z,  y, ,  which  is  the  linear  IV  estimator 
introduced  in  Section  4.8.6. 

No  unique  solution  exists  if  there  are  more  potential  instruments  than  regressors, 
since  then  dim(z)  >  K  and  there  are  more  equations  than  unknowns.  One  possibility 
is  to  use  just  K  instruments,  but  there  is  then  an  efficiency  loss.  The  GMM  estimator 
instead  chooses  / 3  to  make  the  vector  N  1  JV  z,(y,  —  x-/3)  as  small  as  possible  using 
quadratic  loss,  so  that  /3GMM  minimizes 


Qn(P) 


v  (=1 


w  N 


-x'/3) 

i= 1 


(6.4) 


where  is  a  dim(z)  x  dim(z)  weighting  matrix.  The  2SLS  estimator  (see  Sec¬ 
tion  4.8.6)  corresponds  to  a  particular  choice  of  W^. 

Instrumental  variables  methods  for  linear  models  are  presented  in  considerable  de¬ 
tail  in  Section  6.4.  An  advantage  of  the  GMM  approach  is  that  it  provides  a  way  to 
specify  the  optimal  choice  of  weighting  matrix  W^,  leading  to  an  estimator  more  effi¬ 
cient  than  2SLS. 

Section  6.5  covers  IV  methods  for  nonlinear  models.  One  advantage  of  the  GMM 
approach  is  that  generalization  to  nonlinear  regression  is  straightforward.  Then  we 
simply  replace  y  —  x  (3  in  the  preceding  expression  for  Qn(J3)  by  the  nonlinear  model 
error  u  =  y  —  g(x'(3)  or  u  =  r(y,  x,  /3). 


6.2.6.  Panel  Data 

Another  leading  application  of  GMM  and  related  estimation  methods  is  to  panel  data 
regression. 

As  an  example,  suppose  y,,  =  x-r/3+nir,  where  i  denotes  individual  and  t  denotes 
time.  From  Section  6.2.1,  pooled  OLS  regression  of  y,,  on  x,r  is  an  MM  estimator 
based  on  the  condition  E[x,rw,,]  =  0.  Suppose  it  is  additionally  assumed  that  the  er¬ 
ror  un  is  uncorrelated  with  regressors  in  periods  other  than  the  current  period.  Then 
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E[x(sm,,  ]  =  0  for  ,v  7^  t  provides  additional  moment  conditions  that  can  be  used  to  ob¬ 
tain  more  efficient  estimators. 

Chapters  22  and  23  provide  many  applications  of  GMM  methods  to  panel  data. 


6.2.7.  Moment  Conditions  from  Economic  Theory 

Economic  theory  can  generate  moment  conditions  that  can  be  used  as  the  basis  for 
estimation. 

Begin  with  the  model 

y,  =  E[y,|xf,  /3]  +  u,, 

where  the  first  term  on  the  right-hand  side  measures  the  “anticipated”  component  of 
y  conditional  on  x  and  the  second  component  measures  the  “unanticipated”  compo¬ 
nent.  As  examples,  y  may  denote  return  on  an  asset  or  the  rate  of  inflation.  Under  the 
twin  assumptions  of  rational  expectations  and  market  clearing  or  market  efficiency, 
we  may  obtain  the  result  that  the  unanticipated  component  is  unpredictable  using  any 
information  that  was  available  at  time  t  for  determining  E  [  y  |  x  | .  Then 

E[(y,  —  E[vf|x,,  /3])|2,]  =  0, 


where  Tt  denotes  information  available  at  time  t. 

By  the  law  of  iterated  expectations,  E[z,(_yf— E[yf  |x,,  /3])]  =  0,  where  z,  is  formed 
from  any  subset  of  I,.  Since  any  part  of  the  information  set  can  be  used  as  an  instru¬ 
ment,  this  provides  many  moment  conditions  that  can  be  the  basis  of  estimation.  If 
time-series  data  are  available  then  GMM  minimizes  the  quadratic  form 


QAP) 


1 

T 


wr 


1 

tL,=iz<u> 


where  ut  =  yt  —  E[y,|x,,  f3].  If  cross-section  data  are  available  at  a  single  time  point  t 
then  GMM  minimizes  the  quadratic  form 


Qn(P)  = 


1  ^ — \N  1  ^ — \N 

TT  /  i  N  —  /  ZiUi 

yj  z — n=i  yy  ^ — "=1 


where  w,  =  y,  —  E[y,  |x, ,  /3]  and  the  subscript  t  can  be  dropped  as  only  one  time  period 
is  analyzed. 

This  approach  is  not  restricted  to  the  additive  structure  used  in  motivation.  All 
that  is  needed  is  an  error  u,  with  the  property  that  E[ a, \'I,  ]  =  0.  Such  conditions 
arise  from  the  Euler  conditions  from  intertemporal  models  of  decision  making  un¬ 
der  certainty.  For  example,  Hansen  and  Singleton  (1982)  present  a  model  of  maxi¬ 
mization  of  expected  lifetime  utility  that  leads  to  the  Euler  condition  E[  ut  [I,  \  =  0, 
where  ut  =  fJ>g“+lrr].  \  —  1,  g,+1  =  cr+i/cr  is  the  ratio  of  consumption  in  two  periods, 
and  rt+\  is  asset  return.  The  parameters  (i  and  a,  the  intertemporal  discount  rate  and 
the  coefficient  of  relative  risk  aversion,  respectively,  can  be  estimated  by  GMM  using 
either  time-series  or  cross-section  data  as  was  done  previously,  with  this  new  defini¬ 
tion  of  iif.  Hansen  (1982)  and  Hansen  and  Singleton  (1982)  consider  time-series  data; 
MaCurdy  (1983)  modeled  both  consumption  and  labor  supply  using  panel  data. 
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Table  6.1.  Generalized  Method  of  Moments:  Examples 


Moment  Function  h(-) 

Estimation  Method 

>’  -  M 
x(y  -  x'/3) 
z(y  -  x'/3) 

3  In  /(y|x,  6)/dG 

Method  of  moments  for  population  mean 
Ordinary  least-squares  regression 
Instrumental  variables  regression 
Maximum  likelihood  estimation 

6.3.  Generalized  Method  of  Moments 

This  section  presents  the  general  theory  of  GMM  estimation.  Generalized  method  of 
moments  defines  a  class  of  estimators.  Different  choice  of  moment  condition  and 
weighting  matrix  lead  to  different  GMM  estimators,  just  as  different  choices  of  dis¬ 
tribution  lead  to  different  ML  estimators.  We  address  these  issues,  in  addition  to  pre¬ 
senting  the  usual  properties  of  consistency  and  asymptotic  normality  and  methods  to 
estimate  the  variance  matrix  of  the  GMM  estimator. 

6.3.1.  Method  of  Moments  Estimator 

The  starting  point  is  to  assume  the  existence  of  r  moment  conditions  for  q  parameters, 

E[h(w;,0o)]  =  O,  (6.5) 

where  6  is  a  q  x  1  vector,  h(-)  is  an  r  x  1  vector  function  with  r  >  q,  and  Oq  denotes 
the  value  of  6  in  the  dgp.  The  vector  w  includes  all  observables  including,  where 
relevant,  a  dependent  variable  y,  potentially  endogenous  regressors  x,  and  instrumental 
variables  z.  The  dependent  variable  y  may  be  a  vector,  so  that  applications  with  systems 
of  equations  or  with  panel  data  are  subsumed.  The  expectation  is  with  respect  to  all 
stochastic  components  of  w  and  hence  y,  x,  and  z. 

The  choice  of  functional  form  for  h(-)  is  qualitatively  similar  to  the  choice  of  model 
and  will  vary  with  application.  Table  6.1  summarizes  some  single-equation  examples 
of  h(w)  =  h(y ,  x,  z,  9)  already  presented  in  Section  6.2. 

If  r  =  q  then  method  of  moments  can  be  applied.  Equality  to  zero  of  the  population 
moment  is  replaced  by  equality  to  zero  of  the  corresponding  sample  moment,  and  the 
method  of  moments  estimator  0MM  is  defined  to  be  the  solution  to 

1  N  - 

-^h(Wl,0)  =  O.  (6.6) 

1=1 

This  is  an  estimating  equations  estimator  that  equivalently  minimizes 

1  n  f  i  n 

Qn(9 )  =  —  h(w<  •  e)  h(w'' '  d) 

with  asymptotic  distribution  presented  in  Section  5.4  and  reproduced  in  (6.13)  in  Sec¬ 
tion  6.3.3. 
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6.3.2.  GMM  Estimator 


The  GMM  estimator  is  based  on  r  independent  moment  conditions  (6.5)  while  q  pa¬ 
rameters  are  estimated. 

If  r  =  q  the  model  is  said  to  be  just-identified  and  the  MM  estimator  in  (6.6)  can  be 
used.  More  formally  r  =  q  is  only  a  necessary  condition  for  just-identification  and  we 
additionally  require  that  Go  in  Proposition  5.1  is  of  rank  q.  Identification  is  addressed 
in  Section  6.3.9. 

If/-  >  q  the  model  is  said  to  be  overidentified  and  (6.6)  has  no  solution  for  0  as 
there  are  more  equations  (r)  than  unknowns  (q).  Instead,  6  is  chosen  so  that  a  quadratic 
form  in  N  1  JT  h(w, ,  9)  is  as  close  to  zero  as  possible.  Specifically,  the  generalized 
methods  of  moments  estimator  0qmm  minimizes  the  objective  function 


Qn(9)  = 


I  N  T  Pi  iV 

—  y  h('w' ■  0)  —  Y h(w' - 0) 

i= 1  i=  1 


(6.7) 


where  the  r  x  r  weighting  matrix  Wat  is  symmetric  positive  definite,  possibly  stochas¬ 
tic  with  finite  probability  limit,  and  does  not  depend  on  9.  The  subscript  N  on  W)v  is 
used  to  indicate  that  its  value  may  depend  on  the  sample.  The  dimension  r  of  Wat, 
however,  is  fixed  as  N  oo.  The  objective  function  can  also  be  expressed  in  matrix 
notation  as  Qn(9)  =  N~ll'H(9)  x  x  iV_1H(0)T,  where  1  is  an  N  x  1  vector  of 
ones  and  \\(9)  is  an  N  x  r  matrix  with  / th  row  h(.v,  ,  x(  ,  9)' . 

Different  choices  of  weighting  matrix  W n  lead  to  different  estimators  that,  although 
consistent,  have  different  variances  if  r  >  q.  A  simple  choice,  though  often  a  poor 
choice,  is  to  let  Wa  be  the  identity  matrix.  Then  Qn(9 )  =  h\  +  h\  +  •  •  •  +  hq  is  the 
sum  of  r  squared  sample  averages,  where  h ,  =  N  h ; ( w, ,  9)  and  li ,( ■ )  is  the  / th 

component  of  h(-)-  The  optimal  choice  of  VV,v  is  given  in  Section  6.3.5. 

Differentiating  Qn(9)  in  (6.7)  with  respect  to  9  yields  the  GMM  first-order 
conditions 


911/(0)' 
N 


89 


eJ 


x  W  n  x 


1  a 


=  0, 


(6.8) 


where  h,  (0)  =  h,  (w, ,  0)  and  we  have  multiplied  by  the  scaling  factor  1  / 2.  These  equa¬ 
tions  will  generally  be  nonlinear  in  0  and  can  be  quite  complicated  to  solve  as  0  may 
appear  in  both  the  first  and  third  terms.  Numerical  solution  methods  are  presented  in 
Chapter  10. 


6.3.3.  Distribution  of  GMM  Estimator 

The  asymptotic  distribution  of  the  GMM  estimator  is  given  in  the  following  proposi¬ 
tion,  derived  in  Section  6.3.9. 

Proposition  6.1  (Distribution  of  GMM  Estimator):  Make  the  following  as¬ 
sumptions: 

(i)  The  dgp  imposes  the  moment  condition  (6.5);  that  is,  E[h(w,  0O)]  =  0. 

(ii)  The  /  x  1  vector  function  h(-)  satisfies  h(w,  0(1))  =  h(w,  0<2))  iff  9^  =  0(2). 
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(iii)  The  following  r  x  q  matrix  exists  and  is  finite  with  rank  q: 

1  N 

C0  =  plim-  Y, 

i= 1 

(iv)  Wo.  where  Wo  is  finite  symmetric  positive  definite. 

(v)  i  h i\e0  4  Af[0.  S(fl0)].  where 

So  =  phmN-lY  E[h'h'  |  J  • 

<= i  j= i 


3h; 

w 

00 

(6.9) 


(6.10) 


Then  the  GMM  estimator  0gmm.  defined  to  be  a  root  of  the  first-order  conditions 
dQN(0)/dO  =  0  given  in  (6.8),  is  consistent  for  0()  and 

Vn(6GMm  -  0O)  4  AC  [0,  (G;WoGo)-1(G^WoSoW0Go)(G[)WoGo)-1]  .  (6.1 1) 

Some  leading  specializations  are  the  following. 

First,  in  microeconometric  analysis  data  are  usually  assumed  to  be  independent  over 
i,  so  (6.10)  simplifies  to 

S0  =  Plimi  J][h,h'|  J.  (6.12) 

1  =  1 

If  additionally  the  data  are  assumed  to  be  identically  distributed  then  (6.9)  and 
(6.10)  simplify  to  Go  =  E[3h JdO'  |  ]  and  So  =  E[hh'  |  ^  ],  a  notation  used  by  many 

authors. 

Second,  in  the  just-identified  case  that  r  =  q,  the  situation  for  many  estimators 
including  ML  and  LS,  the  results  simplify  to  those  already  presented  in  Section  5.4  for 
the  estimating  equations  estimator.  To  see  this  note  that  when  r  =  q  the  matrices  Go, 
Wo,  and  So  are  square  matrices  that  are  invertible,  so  (GqWoGo)-1  =  G0  1  W(|  1  (G,jF  1 
and  the  variance  matrix  in  (6.11)  simplifies.  It  follows  that,  for  the  MM  estimator  in 
(6.6), 


Vn(9mm  ~  Oo)  4  Af  [0,  G^SotG;,)-1] .  (6.13) 

An  MM  estimator  can  always  be  computed  as  a  GMM  estimator  and  will  be  invariant 
to  the  choice  of  full  rank  weighting  matrix. 

Third,  the  best  choice  of  matrix  W;V  is  one  such  that  Wo  =  Sq  .  Then  the  variance 
matrix  in  (6.11)  simplifies  to  (G'0Sq  'Go)-1.  This  is  expanded  on  in  Section  6.3.5. 

6.3.4.  Variance  Matrix  Estimation 

Statistical  inference  for  the  GMM  estimator  is  possible  given  consistent  estimates  G 
of  Go,  W  of  Wo,  and  S  of  So  in  (6.1 1).  Consistent  estimates  are  easily  obtained  under 
relatively  weak  distributional  assumptions. 
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For  Go  the  obvious  estimator  is 


g  =  —  y* 

N  f-f 

1  =  1 


3h; 

36/' 


(6.14) 


For  Wo  the  sample  weighting  matrix  is  used.  The  estimator  for  the  r  x  r  matrix  So 
varies  with  the  stochastic  assumptions  made  about  the  dgp.  Microeconometric  analysis 
usually  assumes  independence  over  i,  so  that  So  is  of  the  simpler  form  (6.12).  An 
obvious  estimator  is  then 

1  a  _  _ 

S=-J>(0)h  A6)'.  (6.15) 

1  =  1 

Since  h(-)  is  r  x  1,  there  are  at  most  a  finite  number  of  r(r  +  l)/2  unique  entries  in  So 
to  be  estimated.  So  S  is  consistent  as  N  — >  oo  without  need  to  parameterize  the 
variance  E[h,hJ  ],  assumed  to  exist,  to  depend  on  fewer  parameters.  All  that  is  re¬ 
quired  are  some  mild  additional  assumptions  to  ensure  that  plim  A-1  JT  h,h-  = 
plim  N  1  JT  h,h'.  For  example,  if  h,  =  x,m, ,  where  u)  is  the  OLS  residual,  we  know 
from  Section  4.4  that  existence  of  fourth  moments  of  the  regressors  needs  to  be 
assumed. 

Combining  these  results,  we  have  that  the  GMM  estimator  is  asymptotically  nor¬ 
mally  distributed  with  mean  6 o  and  estimated  asymptotic  variance 

V[0Gmm]  =  ^  (G'WaiG)-1  G'W^SWwG  (G'W^vG)-1  .  (6.16) 

This  variance  matrix  estimator  is  a  robust  estimator  that  is  an  extension  of  the  Eicker- 
White  heteroskedastic-consistent  estimator  for  least-squares  estimators. 

One  can  also  take  expectations  and  use  Ge  =  £f  E[3h,/36»']|?  for  G0  and 
Se  =  A  1  JT  E[h,h'  ]|-g  for  So-  Flowever,  this  usually  requires  additional  distribu¬ 
tional  assumptions  to  take  the  expectation,  and  the  variance  matrix  estimate  will  not 
be  as  robust  to  distributional  misspecification. 

In  the  time-series  case  h,  is  subscripted  by  time  t,  and  asymptotic  theory  is  based 
on  the  number  of  time  periods  T  oo.  For  time-series  data,  with  h,  a  vector 
MA(g)  process,  the  usual  estimator  of  V[0qmm1  is  °ne  proposed  by  Newey  and 
West  (1987b)  that  uses  (6.16)  with  S  =  F2o  +  H^=i(l  —  where  f = 

T-'  ELj+i  h/hj  This  permits  time-series  correlation  in  h,  in  addition  to  contem¬ 
poraneous  correlation.  Further  details  on  covariance  matrix  estimation,  including  im¬ 
provements  in  the  time-series  case,  are  given  in  Davidson  and  MacKinnon  (1993,  Sec¬ 
tion  17.5),  Hamilton  (1994),  and  Haan  and  Levin  (1997). 


6.3.5.  Optimal  Weighting  Matrix 

Application  of  GMM  requires  specification  of  moment  function  h(  )  and  weighting 
matrix  in  (6.7). 

The  easy  part  is  choosing  to  obtain  the  GMM  estimator  with  the  smallest 
asymptotic  variance  given  a  specified  function  h(  ).  This  is  often  called  optimal  GMM 
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even  though  it  is  a  limited  form  of  optimality  since  a  poor  choice  of  h(-)  could  still  lead 
to  a  very  inefficient  estimator. 

For  just-identified  models  the  same  estimator  (the  MM  estimator)  is  obtained  for 
any  full  rank  weighting  matrix,  so  one  might  just  as  well  set  W,v  =  lq. 

For  overidentified  models  with  r  >  q,  and  So  known,  the  most  efficient  GMM  es¬ 
timator  is  obtained  by  choosing  the  weighting  matrix  W  ;V  =  S(|  1 .  Then  the  variance 
matrix  given  in  the  proposition  simplifies  and 

VaT^gmm  -  0o)  4  M [0,  (G^So 'Go)’1] ,  (6.17) 

a  result  due  to  Hansen  (1982). 

This  result  can  be  obtained  using  matrix  arguments  similar  to  those  that  establish 
that  GLS  is  the  most  efficient  WLS  estimator  in  the  linear  model.  Even  more  simply, 
one  can  work  directly  with  the  objective  function.  For  LS  estimators  that  minimize  the 
quadratic  form  u'Wu  the  most  efficient  estimator  is  GLS  that  sets  W  =  X  1  =  V  [u]  1 . 
The  GMM  objective  function  in  (6.7)  is  of  this  quadratic  form  with  u  =  AU1  h,(0) 
and  so  the  optimal  W  =  (V[(V-'  h;(0)])_1  =  Sq '.  The  optimal  GMM  estimator 

weights  by  the  inverse  of  the  variance  matrix  of  the  sample  moment  conditions. 


Optimal  GMM 

In  practice  So  is  unknown  and  we  let  YV,v  =  S  1 ,  where  S  is  consistent  for  So-  The 
optimal  GMM  estimator  can  be  obtained  using  a  two-step  procedure.  At  the  first  step 
a  GMM  estimator  is  obtained  using  a  suboptimal  choice  of  W n,  such  as  YV,v  =  I, 
for  simplicity.  From  this  first  step,  form  estimate  S  using  (6.15).  At  the  second  step 
perform  an  optimal  GMM  estimator  with  optimal  weighting  matrix  W n  =  S_1. 

Then  the  optimal  GMM  estimator  or  two-step  GMM  estimator  #ogmm  based  on 
h,  (0)  minimizes 


Qn(0 )  = 


i= 1  i'=l 


(6.18) 


The  limit  distribution  is  given  in  (6.17).  The  optimal  GMM  estimator  is  asymptoti¬ 
cally  normally  distributed  with  mean  9q  and  estimated  asymptotic  variance  with  the 
relatively  simple  formula 

V^ocMM^iV-'lG'S-'Gr1.  (6.19) 


Usually  evaluation  of  G  and  S  is  at  #ogmm>  so  S  uses  the  same  formula  as  S  except  that 
evaluation  is  at  (?ogmm-  An  alternative  is  to  continue  to  evaluate  (6.19)  at  the  first-step 
estimator,  as  any  consistent  estimate  of  can  be  used. 

Remarkably,  the  optimal  GMM  estimator  in  (6.18)  requires  no  additional  stochastic 
assumptions  beyond  those  needed  to  permit  use  of  (6.16)  to  estimate  the  variance 
matrix  of  suboptimal  GMM.  In  both  cases  S  needs  to  be  consistent  for  So  and  from  the 
discussion  after  (6.15)  this  requires  few  additional  assumptions.  This  stands  in  stark 
contrast  to  the  additional  assumptions  needed  for  GLS  to  be  more  efficient  than  OLS 
when  errors  are  heteroskedastic.  Heteroskedasticity  in  the  errors  will  affect  the  optimal 
choice  of  h,  (0),  however  (see  Section  6.3.7). 
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Small-Sample  Bias  of  Two-Step  GMM 

Theory  suggests  that  for  overidentified  models  it  is  best  to  use  optimal  GMM.  In  imple¬ 
mentation,  however,  the  theoretical  optimal  weighting  matrix  Wn  =  Sq  1  needs  to  be 
replaced  by  a  consistent  estimate  S  1 .  This  replacement  makes  no  difference  asymp¬ 
totically,  but  it  will  make  a  difference  in  finite  samples.  In  particular,  individual  obser¬ 
vations  that  increase  h ,-(0)  in  (6.18)  are  likely  to  increase  S  =  N  _1  JV  h,h'  in  (6.18), 
leading  to  correlation  between  N  1  JT  h,(0)  and  S.  Note  that  So  =  plim  N~l  JV  h,h- 
is  not  similarly  affected  because  the  probability  limit  is  taken. 

Altonji  and  Segal  (1996)  demonstrated  this  problem  in  estimation  of  covariance 
structure  models  using  panel  data  (see  Section  22.5).  They  used  the  related  minimum 
distance  estimator  (see  Section  6.7)  but  in  the  literature  their  results  are  intrepreted  as 
being  relevant  to  GMM  estimation  with  cross-section  data  or  short  panels.  In  simula¬ 
tions  the  optimal  estimator  was  more  efficient  than  a  one-step  estimator,  as  expected. 
However,  the  optimal  estimator  had  finite-sample  bias  so  large  that  its  root  mean- 
squared  error  was  much  larger  than  that  for  the  one-step  estimator. 

Altonji  and  Segal  (1996)  also  proposed  a  variant,  an  independently  weighted  op¬ 
timal  estimator  that  forms  the  weighting  matrix  using  observations  other  than  used  to 
construct  the  sample  moments.  They  split  the  sample  into  G  groups,  with  G  =  2  an 
obvious  choice,  and  minimize 

Qn(0)  =  i  V  (6.20) 

where  h4,(0)  is  computed  for  the  gth  group  and  S(_gj  is  computed  using  all  but  the  gth 
group.  This  estimator  is  less  biased,  since  the  weighting  matrix  Sjl  (  is  by  construction 
independent  of  hg(0).  However,  splitting  the  sample  leads  to  efficiency  loss.  Horowitz 
(1998a)  instead  used  the  bootstrap  (see  Section  11.6.4). 

In  the  Altonji  and  Segal  (1996)  example  h,  involves  second  moments,  so  S  involves 
fourth  moments.  Finite-sample  problems  for  the  optimal  estimator  may  not  be  as  sig¬ 
nificant  in  other  examples  where  h,  involves  only  first  moments.  Nonetheless,  Altonji 
and  Segal’s  results  do  suggest  caution  in  using  optimal  GMM  and  that  differences 
between  one-step  GMM  and  optimal  GMM  estimates  may  indicate  problems  of  finite- 
sample  bias  in  optimal  GMM. 

Number  of  Moment  Restrictions 

In  general  adding  further  moment  restrictions  improves  asymptotic  efficiency,  as  it 
reduces  the  limit  variance  (GqSq  'Go)-1  of  the  optimal  GMM  estimator  or  at  worst 
leaves  it  unchanged. 

The  benefits  of  adding  further  moment  conditions  vary  with  the  application.  For  ex¬ 
ample,  if  the  estimator  is  the  MLE  then  there  is  no  gain  since  the  MLE  is  already  fully 
efficient.  The  literature  has  focused  on  IV  estimation  where  gains  may  be  considerable 
because  the  variable  being  instrumented  may  be  much  more  highly  correlated  with  a 
combination  of  many  instruments  than  with  a  single  instrument. 

There  is  a  limit,  however,  as  the  number  of  moment  restrictions  cannot  exceed 
the  number  of  observations.  Moreover,  adding  more  moment  conditions  increases  the 


177 


GENERALIZED  METHOD  OF  MOMENTS  AND  SYSTEMS  ESTIMATION 


likelihood  of  finite-sample  bias  and  related  problems  similar  to  those  of  weak  instru¬ 
ments  in  linear  models  (see  Section  4.9).  Stock  et  al.  (2002)  briefly  consider  weak 
instruments  in  nonlinear  models. 


6.3.6.  Regression  with  Symmetric  Error  Example 


To  demonstrate  the  GMM  asymptotic  results  we  return  to  the  additional  moment  re¬ 
strictions  example  introduced  in  Section  6.2.4.  For  this  example  the  objective  function 
for  /3qmm  has  already  been  given  in  (6.2).  All  that  is  required  is  specification  of  Wat, 
such  as  =  I. 

To  obtain  the  distribution  of  this  estimator  we  use  the  general  notation  of  Section 
6.3.  The  function  h(-)  in  (6.5)  specializes  to 


h(y,  x,  (3) 


'  x(y  -  x'(3)  " 

9h(y,  x,  (3) 

— xx' 

_x(y  -  x'/3)3  _ 

9/3' 

— 3xx'(y  —  x'/3)2 

These  expressions  lead  directly  to  expressions  for  Go  and  So  using  (6.9)  and  (6.12),  so 
that  (6.14)  and  (6.15)  then  yield  consistent  estimates 


G  = 


-77  E,x<x' 
L-77E,  3«,2x,x' J 


(6.21) 


and 


S  = 


77  £,■  «,x!X; 


— 4  / 

U;XjX; 


jyE,“fx,x;. 


(6.22) 


where  u ,■  =  y  —  x'/3.  Alternative  estimates  can  be  obtained  by  first  evaluating  the  ex¬ 
pectations  in  Go  and  So,  but  this  will  require  assumptions  on  E[m2|x],  E[w4|x],  and 
E[m6|x|.  Substituting  G,  S,  and  W,y  into  (6.16)  gives  the  estimated  asymptotic  vari¬ 
ance  matrix  for  /3GMM. 

Now  consider  GMM  with  an  optimal  weighting  matrix.  This  again  minimizes  (6.2), 
but  from  (6.18)  now  W;y  =  S-1,  where  S  is  dehned  in  (6.22).  Computation  of  S  re¬ 
quires  first-step  consistent  estimates  (3.  An  obvious  choice  is  GMM  with  =  I. 
In  this  example  the  OLS  estimator  is  also  consistent  and  could  instead  be  used. 
Using  (6.19)  gives  this  two-step  estimator  an  estimated  asymptotic  variance  matrix 
V[3ogmm]  equal  to 


’  £/  M.'XiXj  ' 

/ 

«rx'-x' 

£,•  ufx.x'r 

_  Y.,  «,3x/x'  _ 

’  «»-nx; 

.Ei^XiXj 


where  w,  =  y,  —  xJ/3OGMM  and  the  various  divisions  by  N  have  canceled  out. 

Analytical  results  for  the  efficiency  gain  of  optimal  GMM  in  this  example  are  eas¬ 
ily  obtained  by  specialization  to  the  nonregression  case  where  y  is  iid  with  mean  //. 
Furthermore,  assume  that  y  is  Laplace  distributed  with  scale  parameter  equal  to  unity, 
in  which  case  the  density  is  f(y)  =  (1/2)  x  exp{  — [y  —  /x |}  with  E[  v]  =  /x,  V[y]  =  2, 
and  higher  central  moments  E[(y  —  /1)'  |  equal  to  zero  for  r  odd  and  equal  to  r!  for 
r  even.  The  sample  median  is  fully  efficient  as  it  is  the  MLE,  and  it  can  be  shown  to 
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have  asymptotic  variance  1  /TV.  The  sample  mean  y  is  inefficient  with  variance  V[v]  = 
V[y]/iV  =  2/N.  The  optimal  GMM  estimator  Jl"t"  based  on  the  two  moment  condi¬ 
tions  E[(y  —  n)\  =0  and  E[(y  —  /x ) 3  ]  =  0  has  weighting  matrix  that  places  much  less 
weight  on  the  second  moment  condition,  because  it  has  relatively  high  variance,  and 
has  negative  off-diagonal  entries.  The  optimal  GMM  estimator  Mogmm  can  be  shown 
to  have  asymptotic  variance  1.7143/1V  (see  Exercise  6.3).  It  is  therefore  more  efficient 
than  the  sample  mean  (variance  2/N),  though  is  still  considerably  less  efficient  than 
the  sample  median. 

For  this  example  the  identity  matrix  is  an  exceptionally  poor  choice  of  weighting 
matrix.  It  places  too  much  weight  on  the  second  moment  condition,  yielding  a  sub- 
optimal  GMM  estimator  of  /r  with  asymptotic  variance  1 9 . 1 4 /  /V  that  is  many  times 
greater  than  even  V[y  |  =  2/N .  For  details  see  Exercise  6.3. 


6.3.7.  Optimal  Moment  Condition 


Section  6.3.5  gives  the  surprising  result  that  optimal  GMM  requires  essentially  no 
more  assumptions  than  does  GMM  without  an  optimal  weighting  matrix.  However, 
this  optimality  is  very  limited  as  it  is  conditional  on  the  choice  of  moment  function 
h(-)  in  (6.5)  or  (6.18). 

The  GMM  defines  a  class  of  estimators,  with  different  choice  of  h(-)  correspond¬ 
ing  to  different  members  of  the  class.  Some  choices  of  h(  )  are  better  than  others,  de¬ 
pending  on  additional  stochastic  assumptions.  For  example,  h,  =  x,  n,  yields  the  OLS 
estimator  whereas  h,  =  x,m,/V[h,|x,  ]  yields  the  GLS  estimator  when  errors  are  het- 
eroskedastic.  This  multitude  of  potential  choices  for  h(-)  can  make  any  particular 
GMM  estimator  appear  ad  hoc.  However,  qualitatively  similar  decisions  have  to  be 
made  in  m-estimation  in  choosing,  for  example,  to  minimize  the  sum  of  squared  errors 
rather  than  the  weighted  sum  of  squared  errors  or  the  sum  of  absolute  deviations  of 
errors. 

If  complete  distributional  assumptions  are  made  the  most  efficient  estimator  is  the 
MLE.  Thus  the  optimal  choice  of  h(-)  in  (6.5)  is 


h(w,  0 )  — 


9  In  /( w,  0) 

90 


where  /( w,  0)  is  the  joint  density  of  w.  For  regression  with  dependent  variable(s)  y 
and  regressors  x  this  is  the  unconditional  MLE  based  on  the  unconditional  joint  den¬ 
sity  /( y,  x,  0)  of  y  and  x.  In  many  applications  /( y,  x,  0)  =  /(y|x,  0)g(x),  where  the 
(suppressed)  parameters  of  the  marginal  density  of  x  do  not  depend  on  the  parameters 
of  interest  0.  Then  it  is  just  as  efficient  to  use  the  conditional  MLE  based  on  the  con¬ 
ditional  density  /(y|x,  0).  This  can  be  used  as  the  basis  for  MM  estimation,  or  GMM 
estimation  with  weighting  matrix  =  Iq ,  though  any  full -rank  matrix  YVv  will  also 
give  the  MLE.  This  result  is  of  limited  practical  use,  however,  as  the  purpose  of  GMM 
estimation  is  to  avoid  making  a  full  set  of  distributional  assumptions. 

When  incomplete  distributional  assumptions  are  made,  a  common  starting  point  is 
specification  of  a  conditional  moment  condition,  where  conditioning  is  on  exoge¬ 
nous  variables.  This  is  usually  a  low-order  moment  condition  for  the  model  error  such 
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as  E[w|x]  =  0  or  E[m|z]  =  0.  This  conditional  moment  condition  can  lead  to  many 
unconditional  moment  conditions  that  might  be  the  basis  for  GMM  estimation,  such 
as  E[zm]  =  0.  Newey  (1990a,  1993)  obtained  results  on  the  optimal  choice  of  uncon¬ 
ditional  moment  condition  for  data  independent  over  i . 

Specifically,  begin  with  s  conditional  moment  condition  restrictions 

E[r(y,  x,  0o)|z]  =  0,  (6.23) 


where  r(-)  is  a  residual-type  sxl  vector  function  introduced  in  Section  6.2.2.  A  scalar 
example  is  E[y  —  x'9{)\r/.\  =  0.  The  instrumental  variables  notation  is  being  used  where 
x  are  regressors,  some  potentially  endogenous,  and  z  are  instruments  that  include  the 
exogenous  components  of  x.  In  simpler  models  without  endogeneity  z  =  x. 

GMM  estimation  of  the  q  parameters  9  based  on  (6.23)  is  not  possible,  as  typically 
there  are  only  a  few  conditional  moment  restrictions,  and  often  just  one,  so  s  <  q. 
Instead,  we  introduce  an  r  x  s  matrix  function  of  the  instruments  D(z),  where  r  >  q, 
and  note  that  by  the  law  of  iterated  expectations  E[D(z)r(y,  x,  9o)  \  =  0,  which  can  be 
used  as  the  basis  for  GMM  estimation.  The  optimal  instruments  or  optimal  choice  of 
matrix  function  D(z)  can  be  shown  to  be  the  q  x  s  matrix 


D*(z,  e0)  =  E 


dr(y,x,90)'  ' 

- z 

dO  ' 


{V  [r(y,  x,  flollz]}-1. 


(6.24) 


A  derivation  is  given  in,  for  example,  Davidson  and  MacKinnon  (1993,  p.  604).  The 
optimal  instrument  matrix  D*(z)  is  a  q  x  s  matrix,  so  the  unconditional  moment  con¬ 
dition  E[D*(z)r(y,  x,  9o)\  =  0  yields  exactly  as  many  moment  conditions  as  param¬ 
eters.  The  optimal  GMM  estimator  simply  solves  the  corresponding  sample  moment 
conditions 


1  N 

-^D*(z;,0)r(y,-,x;,0)  =  <).  (6.25) 

1  =  1 


The  optimal  estimator  requires  additional  assumptions,  namely  the  expectations 
used  in  forming  D*(z,  9{])  in  (6.24),  and  implementation  requires  replacing  unknown 
parameters  by  known  parameters  so  that  generated  regressors  D  are  used. 

For  example,  if  r(y,  x,  9)  =  y  —  exp (x'9)  then  dr/ 99  =  —  exp (x'0)x  and  (6.24) 
requires  specification  of  Elexplx'Oolxlz]  and  V [  y  —  exp(x'#)|z].  One  possibility  is 
to  assume  Efexplx'^olxlz]  is  a  low-order  polynomial  in  z,  in  which  case  there  will 
be  more  moment  conditions  than  parameters  and  so  estimation  is  by  GMM  rather 
than  simply  by  solving  (6.25),  and  to  assume  errors  are  homoskedastic.  If  these  addi¬ 
tional  assumptions  are  wrong  then  the  estimator  is  still  consistent,  provided  (6.23)  is 
valid,  and  consistent  standard  errors  can  be  obtained  using  the  robust  form  of  the  vari¬ 
ance  matrix  in  (6.16).  It  is  common  to  more  simply  use  z  rather  than  D*(z,  9)  as  the 
instrument. 


Optimal  Moment  Condition  for  Nonlinear  Regression  Example 

The  result  (6.24)  is  useful  in  some  cases,  especially  those  where  z  =  x.  Here  we  con¬ 
firm  that  GLS  is  the  most  efficient  GMM  estimator  based  on  E[n|x]  =  0. 
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Consider  the  nonlinear  regression  model  y  =  g(x,  (3)  +  u.  If  the  starting  point  is 
the  conditional  moment  restriction  E[«|x]  =  0,  or  E[v  —  g(x,  (3)\x]  =  0,  then  z  =  x  in 
(6.23),  and  (6.24)  yields 

{V[y-g(x,  /30)|x]  }_1 

dg(x,  /30)  1 

9/3  X  V  [n|x]  ’ 

which  requires  only  specification  of  V[n|x].  From  (6.25)  the  optimal  GMM  estimator 
directly  solves  the  corresponding  sample  moment  conditions 


D*(x,  (3)  =  E 


8(x'  ^o))ix 


1  9#(xg/3)  (.V/  -g(x,  ,/3» 

—  > - - —  x  - * - =  0, 


(=i 


9/3 


where  of  =  V[w,  |x,]  is  functionally  independent  of  (3.  These  are  the  first-order  condi¬ 
tions  for  generalized  NLS  when  the  error  is  heteroskedastic.  Implementation  is  possi¬ 
ble  using  a  consistent  estimate  07  of  of,  in  which  case  GMM  estimation  is  the  same 
as  FGNLS.  One  can  obtain  standard  errors  robust  to  misspecification  of  of  as  detailed 
in  Section  5.8. 

Specializing  to  the  linear  model,  g(x,  (3)  =  x' [3  and  the  optimal  GMM  estimator 
based  on  E[n|x]  =  0  is  GLS,  and  specializing  further  to  the  case  of  homoskedastic 
errors,  the  optimal  GMM  estimator  based  on  E[n|x]  =  0  is  OLS.  As  already  seen  in 
the  example  in  Section  6.3.6,  more  efficient  estimation  may  be  possible  if  additional 
conditional  moment  conditions  are  used. 


6.3.8.  Tests  of  Overidentifying  Restrictions 

Hypothesis  tests  on  6  can  be  performed  using  the  Wald  test  (see  Section  5.5),  or  with 
other  methods  given  in  Section  7.5. 

In  addition  there  is  a  quite  general  model  specification  test  that  can  be  used  for  over¬ 
identified  models  with  more  moment  conditions  (r)  than  parameters  ( q ).  The  test  is  one 
of  the  closeness  of  N  1  JT  h,  to  0,  where  h,  =  h(w,  ,  6).  This  is  an  obvious  test  of  Hy. 
E[h(w,  0{))\  =  0,  the  initial  population  moment  conditions.  For  just-identified  models, 
estimation  imposes  N  1  JT  h,  =  0  and  the  test  is  not  possible.  For  over-identified 
models,  however,  the  first-order  conditions  (6.8)  set  a  q  x  r  matrix  times  N  1  JT  h, 
to  zero,  where  q  <  r,  so  h,  /  0. 

In  the  special  case  that  6  is  estimated  by  #ogmm  defined  in  (6.18),  Hansen  (1982) 
showed  that  the  overidentifying  restrictions  (OIR)  test  statistic 

OIR  =  (A-1  XI  h')  s  '  (A?_1  X  ^)  (6'26) 

is  asymptotically  distributed  as  y  2(r  —  q )  under  Hq  :E[h(w,  #o)]  =  0-  Note  that  OIR 
equals  the  GMM  objective  function  (6.18)  evaluated  at  #ogmm-  If  OIR  is  large  then 
the  population  moment  conditions  are  rejected  and  the  GMM  estimator  is  inconsistent 
for  6. 
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It  is  not  obvious  a  priori  that  the  particular  quadratic  form  in  N~ 1  22i  h/  given  in 
(6.26)  is  /2(r  —  q)  distributed  under  Hq.  A  formal  derivation  is  given  in  the  next 
section  and  an  intuitive  explanation  in  the  case  of  linear  IV  estimation  is  provided 
in  Section  8.4.4. 

A  classic  application  is  to  life-cycle  models  of  consumption  (see  Section  6.2.7),  in 
which  case  the  orthogonality  conditions  are  Euler  conditions.  A  large  chi-square  test 
statistic  is  then  often  stated  to  mean  rejection  of  the  life-cycle  hypothesis.  However,  it 
should  instead  be  more  narrowly  interpreted  as  rejection  of  the  particular  specification 
of  utility  function  and  set  of  stochastic  assumptions  used  in  the  study. 


6.3.9.  Derivations  for  the  GMM  Estimator 

The  algebra  is  simplified  by  introducing  a  more  compact  notation.  The  GMM  estimator 
minimizes 


Qn(0)  =  gN(0)'WNgN(e),  (6.27) 

where  g N(0)  =  N~l  h,(0).  Then  the  GMM  first-order  conditions  (6.8)  are 

GtfW'Wtfg*  (?)  =  0,  (6.28) 

where  G N(0)  =  dgN(9)/d9'  =  N~l  3h ,  (6>)/a6>'. 

For  consistency  we  consider  the  informal  condition  that  the  probability  limit  of 
3Qw(0)/30|0o  equals  zero.  From  (6.28)  this  will  be  the  case  as  Gn(90)  and 
have  finite  probability  limits,  by  assumptions  (iii)  and  (iv)  of  Proposition  6.1,  and 
pi i m  gy(rt0)  =  0  as  a  consequence  of  assumption  (v).  More  intuitively,  g,\(9o)  = 
N  1  22  i  h,  (0o)  has  probability  limit  zero  if  a  law  of  large  numbers  can  be  applied 
and  E[h,(0o)]  =  0,  which  was  assumed  at  the  outset  in  (6.5). 

The  parameter  6 o  is  identified  by  the  key  assumption  (ii)  and  additionally  assump¬ 
tions  (iii)  and  (iv),  which  restrict  the  probability  limits  of  G ,v (9  o)  and  W;y  to  be  full- 
rank  matrices.  The  assumption  that  Go  =  plimG;v(0o)  is  a  full-rank  matrix  is  called 
the  rank  condition  for  identification.  A  weaker  necessary  condition  for  identification 
is  the  order  condition  that  r  >  q. 

For  asymptotic  normality,  a  more  general  theory  is  needed  than  that  for  an  m- 
estimator  based  on  an  objective  function  Qn(J3)  =N~1  22,  <?(w; 9 )  that  involves  just 
one  sum.  We  rescale  (6.28)  by  multiplication  by  >/N,  so  that 

GN(9)'WNVNgN(9)  =  0.  (6.29) 

The  approach  of  the  general  Theorem  5.3  is  to  take  a  Taylor  series  expansion  around 
9q  of  the  entire  left-hand  side  of  (6.28).  Since  9  appears  in  both  the  first  and  third 
terms  this  is  complicated  and  requires  existence  of  first  derivatives  of  G n(9)  and  hence 
second  derivatives  of  g n(9).  Since  G  y ( 9 )  and  WjV  have  finite  probability  limits  it  is 
sufficient  to  more  simply  take  an  exact  Taylor  series  expansion  of  only  \/NgN(9).  This 
yields  an  expression  similar  to  that  in  the  Chapter  5  discussion  of  m-estimation,  with 

sfNgN(9)  =  VNgN(90)  +  G n(9+)Vn(9  -  90),  (6.30) 
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recalling  that  Gn(G)  =  9g u{Q)/dO' ,  where  0 1  is  a  point  between  6 o  and  6.  Substitut¬ 
ing  (6.30)  back  into  (6.29)  yields 


Gw(0)'Wa 


VNgN(0o)  +  G n(G+)Vn(G  -  6>0)]  =  0. 


Solving  for  \/N(d  —  Go)  yields 


Vn(§  -  G0)  =  -  [GN(ffy\VNGN(e+)]  1  GN(8)'WNVNgN(G0). 


(6.31) 


Equation  (6.31)  is  the  key  result  for  obtaining  the  limit  distribution  of  the  GMM 
estimator.  We  obtain  the  probability  limits  of  each  of  the  first  five  terms  using  6  4  Go, 
given  consistency,  in  which  case  G+  4  Gq.  The  last  term  on  the  right-hand  side  of 
(6.31)  has  a  limit  normal  distribution  by  assumption  (v).  Thus 

Vn(G  -  Go)  4  — (GqW0Go)_1GoW0  X  MO,  So], 


where  Go,  Wo,  and  So  have  been  defined  in  Proposition  6.1.  Applying  the  limit  normal 
product  rule  (Theorem  A.  17)  yields  (6.11). 

This  derivation  treats  the  GMM  first-order  conditions  as  being  q  linear  combina¬ 
tions  of  the  r  sample  moments  g a?(0),  since  GaK^/Wjv  is  a  q  x  r  matrix.  The  MM 
estimator  is  the  special  case  q  =  r,  since  then  G,v(0/ W.y  is  a  full-rank  square  matrix, 
so  GA?(0),Wivgiv(0)  =  0  implies  that  g N(6)  =  0. 

To  derive  the  distribution  of  the  OIR  test  statistic  in  (6.26),  begin  with  a  first-order 
Taylor  series  expansion  of  s/hig^iG)  around  Go  to  obtain 

4V  gv(^OGMM)  =  4v  gAf(0o)  +  Gn(G+)VN  (G  OGMM  —  #0) 

=  VNgN(G0)  -  GoIGqSq  1 G0)- 1  GqSq  1 4/V g^ (G0)  +  op(  1) 

=  [I  -  M0Sq ‘ix/iV gjv(0o)  +  oP{  1), 

where  the  second  equality  uses  (6.31)  with  W n  consistent  for  Sq  l,  Mo  = 
Go(GqSo  'GoG’Gq,  and  op(  1)  is  defined  in  Definition  A.22.  It  follows  that 

So“1/2VAg„(0oGMM)  =  So  1/2[I  -  M0Sq ‘]\4v gAf(0o)  +  0P(  1)  (6.32) 

=  [I  -  Sq  1/2MoS0-1/2]S0~  1/2Mvg)V(0o)  +  0,(1). 

Now  [I  -  S“  1/2M0So1/2]  =  [I  -  S~  1/2G0(G[)S0‘1Go)-1G[)So 1/2]  is  an  idempotent 

matrix  of  rank  (r  —  q),  and  S0  X,1\fNgN{Go)  4  Af[0, 1]  given  \/NgN(Qo)-^> 
Af[0,  So]-  From  standard  results  for  quadratic  forms  of  normal  variables  it  follows 
that  the  inner  product 

Tv  =  (S0  1/2\4VgAr(0OGMM)y(So  ,s/~N\ gv  (^OGMm)) 

converges  to  the  /2(r  —  q)  distribution. 


6.4.  Linear  Instrumental  Variables 

Correlation  of  regressors  with  the  error  term  leads  to  inconsistency  of  least- 
squares  methods.  Examples  of  such  failure  include  omitted  variables,  simultaneity, 
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measurement  error  in  the  regressors,  and  sample  selection  bias.  Instrumental  variables 
methods  provide  a  general  approach  that  can  handle  any  of  these  problems,  provided 
suitable  instruments  exist. 

Instrumental  variables  methods  fall  naturally  into  the  GMM  framework  as  a  surplus 
of  instruments  leads  to  an  excess  of  moment  conditions  that  can  be  used  for  estimation. 
Many  IV  results  are  most  easily  obtained  using  the  GMM  framework. 

Linear  IV  is  important  enough  to  appear  in  many  places  in  this  book.  An  introduc¬ 
tion  was  given  in  Sections  4.8  and  4.9.  This  section  presents  single-equation  linear  IV 
as  a  particular  application  of  GMM.  For  completeness  the  section  also  presents  the 
earlier  literature  on  a  special  case,  the  two-stage  least-squares  estimator.  Systems  lin¬ 
ear  IV  estimation  is  summarized  in  Section  6.9.5.  Tests  of  endogeneity  and  tests  of 
overidentifying  restrictions  for  linear  models  are  detailed  in  Section  8.4.  Chapter  22 
presents  linear  IV  estimation  with  panel  data. 

6.4.1.  Linear  GMM  with  Instruments 

Consider  the  linear  regression  model 

>■,  =  x'/3+m;,  (6.33) 

where  each  component  of  x  is  viewed  as  being  an  exogenous  regressor  if  it  is  uncor¬ 
related  with  the  error  in  model  (6.33)  or  an  endogenous  regressor  if  it  is  correlated. 
If  all  regressors  are  exogenous  then  LS  estimators  can  be  used,  but  if  any  components 
of  x  are  endogenous  then  LS  estimators  are  inconsistent  for  /3. 

From  Section  4.8,  consistent  estimates  can  be  obtained  by  IV  estimation.  The  key 
assumption  is  the  existence  of  an  r  x  1  vector  of  instruments  z  that  satisfies 

E[«,|z,]  =  0.  (6.34) 

Exogenous  regressors  can  be  instrumented  by  themselves.  As  there  must  be  at  least  as 
many  instruments  as  regressors,  the  challenge  is  to  find  additional  instruments  that  at 
least  equal  the  number  of  endogenous  variables  in  the  model.  Some  examples  of  such 
instruments  have  been  given  in  Section  4.8.2. 


Linear  GMM  Estimator 


From  Section  6.2.5,  the  conditional  moment  restriction  (6.34)  and  model  (6.33)  imply 
the  unconditional  moment  restriction 


E[z,(y,— x-/3)]  =  0, 


(6.35) 


where  for  notational  simplicity  the  following  analysis  uses  (3  rather  than  the  more 
formal  /30  to  denote  the  true  parameter  value.  A  quadratic  form  in  the  corresponding 
sample  moments  leads  to  the  GMM  objective  function  Qn(J3)  given  in  (6.4). 

In  matrix  notation  define  y  =  X/3  +  u  as  usual  and  let  Z  denote  the  N  x  r  matrix 
of  instruments  with  ith  row  z'.  Then  JV  z,(y,  —  x'/3)  =  Z  u  and  (6.4)  becomes 


Qn(P) 


1 

N 


(y  -  x/3)'z 


W;v 


1 

N 


Z'(y  —  X/3) 


(6.36) 
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where  is  an  r  x  r  full-rank  symmetric  weighting  matrix  with  leading  examples 
given  at  the  end  of  this  section.  The  first-order  conditions 


BQn(0) 

3/3 


1  , 

X'Z 

Wjv 

N 

1 

N 


Z'(y  -  X/3) 


=  0 


can  actually  be  solved  for  0  in  this  special  case  of  GMM,  leading  to  the  GMM  esti¬ 
mator  in  the  linear  IV  model 


3gmm  =  [X'ZW^vZ'X]  1  X'ZW^Z'y,  (6.37) 

where  the  divisions  by  N  have  canceled  out. 


Distribution  of  Linear  GMM  Estimator 

The  general  results  of  Section  6.3  can  be  used  to  derive  the  asymptotic  distribution. 
Alternatively,  since  an  explicit  solution  for  /3GMM  exists  the  analysis  for  OLS  given  in 
Section  4.4.  can  be  adapted.  Substituting  y  =  X/3  +  u  into  (6.37)  yields 

Pgmm  =  0  +  [(Ar'X'Z)  Wn  (/V-'Z'X)]-1  (AT'X'Z)  Wn  (N-'Z'u)  .  (6.38) 

From  the  last  term,  consistency  of  the  GMM  estimator  essentially  requires  that 
plimV'Z'u  =  0.  Under  pure  random  sampling  this  requires  that  (6.35)  holds, 
whereas  under  other  common  sampling  schemes  (see  Section  24.3)  the  stronger  as¬ 
sumption  (6.34)  is  needed. 

Additionally,  the  rank  condition  for  identification  of  (3  that  plim  N  ~ 1  Z'X  is  of 
rank  K  ensures  that  the  inverse  in  the  right-hand  side  exists,  provided  is  of  full 
rank.  A  weaker  order  condition  is  that  r  >  K. 

The  limit  distribution  is  based  on  the  expression  for  \fN (/3GMM  —  0)  obtained  by 
simple  manipulation  of  (6.38).  This  yields  an  asymptotic  normal  distribution  for  /3GMM 
with  mean  0  and  estimated  asymptotic  variance 

V[/3GMM]  =  N  [X'ZWwZ'X]-‘  [X'ZWivSW^Z'X]  [X'ZW^Z'X]-1  ,  (6.39) 

where  S  is  a  consistent  estimate  of 

s  =  lim^EE[M/z'z:]’ 

/ =i 

given  the  usual  cross-section  assumption  of  independence  over  i.  The  essential  addi¬ 
tional  assumption  needed  for  (6.39)  is  that  /V_1/2Z'u  -a-  _A/"[0,  S].  Result  (6.39)  also 
follows  from  Proposition  6.1  with  h(-)  =  z(y  —  x'/3)  and  hence  3 h / 3 0'  =  —  zx'. 

For  cross-section  data  with  heteroskedastic  errors,  S  is  consistently  estimated  by 

-  1  N 

S  =  —  ^n-z,z'  =  Z'DZ/N,  (6.40) 

i=  1 

where  if,  =  y,  —  is  the  GMM  residual  and  D  is  an  N  x  N  diagonal  matrix 

with  entries  uj.  A  commonly  used  small-sample  adjustment  is  to  divide  by  N  —  K 
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Table  6.2.  GMM  Estimators  in  Linear  IV  Model  and  Their  Asymptotic  Variance11 


Estimator 


Definition  and  Asymptotic  Variance 


GMM 

(general  Wat) 
Optimal  GMM 

(Wjv  =  S-1) 

2SLS 

(W  N  =  [N-]  Z'Z]-1) 


IV 

(just-identified) 


3gmm  =  [X'ZW^Z'Xj-'X'ZW^Z'y 
V[/3]  =  VtX'ZWivZ'XJ-'lX'ZWivSWwZ'XltX'ZWwZ'X]-1 
3ogmm  =  [X'ZS  Z'Xr'X'ZS  '  Z'v 
V[/3]  =  V[X'ZS  'z'X]"1 
Asls  =  [X'Z(Z'Z)-1Z'X]-IX'Z(Z'Z)-1Z'y 
V[/3]  =  V[X'Z(Z'Z)“1Z'X]-1[X'Z(Z,Z)"1S(Z'Z)"1Z,X] 
x  [x'ziz'zr'z'x]-1 

V[/3]  =  s2[X' Z(Z'Z)“1Z'X]-1  if  homoskedastic  errors 
Av  =  [Z'XT'Z'y 

9(3]  =  Niz'xr'six'zr1 


a  Equations  are  based  on  a  linear  regression  model  with  dependent  variable  y,  regressors  X,  and  instruments 
Z.  S  is  defined  in  (6.40)  and  s2  is  defined  after  (6.41).  All  variance  matrix  estimates  assume  errors  that  are 
independent  across  observations  and  heteroskedastic,  aside  from  the  simplification  for  homoskedastic  errors 
given  for  the  2SLS  estimator.  Optimal  GMM  uses  the  optimal  weighting  matrix. 


rather  than  N  in  the  formula  for  S.  In  the  more  restrictive  case  of  homoskedastic  errors, 
E[//r|z,  ]  =  a2  and  so  S  =  lim  V~ 1  JV  rr2E[z,z' ],  leading  to  estimate 

S  =  s2Z'Z/N,  (6.41) 

where  s2  =  (N  —  K)~l  Tt2  is  consistent  for  a2.  These  results  mimic  similar  re¬ 
sults  for  OLS  presented  in  Section  4.4.5. 


6.4.2.  Different  Linear  GMM  Estimators 

Implementation  of  the  results  of  Section  6.4.1  requires  specification  of  the  weighting 
matrix  W^-  For  just-identified  models  all  choices  of  W;v  lead  to  the  same  estima¬ 
tor.  For  overidentified  models  there  are  two  common  choices  of  Wag  given  in  the 
following. 

Table  6.2  summarizes  these  estimators  and  gives  the  appropriate  specialization  of 
the  estimated  variance  matrix  formula  given  in  (6.39),  assuming  independent  het¬ 
eroskedastic  errors. 


Instrumental  Variables  Estimator 

In  the  just-identified  case  r  =  K  and  X'Z  is  a  square  matrix  that  is  invertible.  Then 
[X'ZW;VZ'Xrl  =  (Z'X)-1W^1(X'Zr1  and  (6.37)  simplifies  to  the  instrumental 
variables  estimator 

Av  =  (Z'XT'Z'y,  (6.42) 

introduced  in  Section  4.8.6.  For  just-identified  models  the  GMM  estimator  for  any 
choice  of  Wat  equals  the  IV  estimator. 
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The  simple  IV  estimator  can  also  be  used  in  overidentified  models,  by  discarding 
some  of  the  instruments  so  that  the  model  is  just-identified,  but  this  results  in  an  effi¬ 
ciency  loss  compared  to  using  all  the  instruments. 


Optimal- Weighted  GMM 

From  Section  6.3.5,  for  overidentified  models  the  most  efficient  GMM  estimator, 
meaning  GMM  with  optimal  choice  of  weighting  matrix,  sets  W n  =  S-1  in  (6.37). 

The  optimal  GMM  estimator  or  two-step  GMM  estimator  in  the  linear  IV  model 
is 

Poouu  =  [(X'ZlS^fZ'X)]-1  (X'ZIS-'CZ'y).  (6.43) 

For  heteroskedastic  errors,  S  is  computed  using  (6.40)  based  on  a  consistent  first-step 
estimate  (3  such  as  the  2SLS  estimator  defined  in  (6.44).  White  (1982)  called  this 
estimator  a  two-stage  IV  estimator,  since  both  steps  entail  IV  estimation. 

The  estimated  asymptotic  variance  matrix  for  optimal  GMM  given  in  Table  6.2 
is  of  relatively  simple  form  as  (6.39)  simplifies  when  WiV  =  S  1 .  In  computing  the 
estimated  variance  one  can  use  S  as  presented  in  Table  6.2,  but  it  is  more  common  to 
instead  use  an  estimator  S,  say,  that  is  also  computed  using  (6.40)  but  evaluates  the 
residual  at  the  optimal  GMM  estimator  rather  than  the  first-step  estimate  used  to  form 
S  in  (6.43). 


Two-Stage  Least  Squares 

If  errors  are  homoskedastic  rather  than  heteroskedastic,  S-1  =  [s2IV_1Z'Z]_1  from 
(6.41).  Then  WN  =  (N~l Z'Z)~X  in  (6.37),  leading  to  the  two-stage  least-squares 
estimator,  introduced  in  Section  4.8.7,  that  can  be  expressed  compactly  as 

32sls  =  [X'PzX]-1  [X'Pzy] ,  (6.44) 

where  Pz  =  Z(ZZ')~lZ'.  The  basis  of  the  term  two-stage  least-squares  is  presented 
in  the  next  section.  The  2SLS  estimator  is  also  called  the  generalized  instrumental 
variables  (GIV)  estimator  as  it  generalizes  the  IV  estimator  to  the  overidentified 
case  of  more  instruments  than  regressors.  It  is  also  called  the  one-step  GMM  because 
(6.44)  can  be  calculated  in  one  step,  whereas  optimal  GMM  requires  two  steps. 

The  2SLS  estimator  is  asymptotically  normal  distributed  with  estimated  asymptotic 
variance  given  in  Table  6.2.  The  general  form  should  be  used  if  one  wishes  to  guard 
against  heteroskedastic  errors  whereas  the  simpler  form,  presented  in  many  introduc¬ 
tory  textbooks,  is  consistent  only  if  errors  are  indeed  homoskedastic. 


Optimal  GMM  versus  2SLS 

Both  the  optimal  GMM  and  the  2SLS  estimator  lead  to  efficiency  gains  in  overiden¬ 
tified  models.  Optimal  GMM  has  the  advantage  of  being  more  efficient  than  2SLS, 
if  errors  are  heteroskedastic,  though  the  efficiency  gain  need  not  be  great.  Some  of 
the  GMM  testing  procedures  given  in  Section  7.5  and  Chapter  8  assume  estimation 
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using  the  optimal  weighting  matrix.  Optimal  GMM  has  the  disadvantage  of  requiring 
additional  computation  compared  to  2SLS.  Moreover,  as  discussed  in  Section  6.3.5, 
asymptotic  theory  may  provide  a  poor  small-sample  approximation  to  the  distribution 
of  the  optimal  GMM  estimator. 

In  cross-section  applications  it  is  common  to  use  the  less  efficient  2SLS,  though 
with  inference  based  on  heteroskedastic  robust  standard  errors. 


Even  More  Efficient  GMM  Estimation 

The  estimator  /3OGMM  is  the  most  efficient  estimator  based  on  the  unconditional  mo¬ 
ment  condition  E[z, «,  |  =  0,  where  w,  =  y,  —  x'/d.  However,  this  is  not  the  best  moment 
condition  to  use  if  the  starting  point  is  the  conditional  moment  condition  E[«,  |z,  ]  =  0 
and  errors  are  heteroskedastic,  meaning  V[n,-|z;]  varies  with  z,-. 

Applying  the  general  results  of  Section  6.3.7,  we  can  write  the  optimal  moment 
condition  for  GMM  estimation  based  on  E[n,  |z,  ]  =  0  as 

E  [E  [x,-  |z;]  K;/V  [m;  |z,]]  =  0.  (6.45) 

As  with  the  LS  regression  example  in  Section  6.3.7,  one  should  divide  by  the  error 
variance  V[n|z],  Implementation  is  more  difficult  than  in  the  LS  case,  however,  as 
a  model  for  E[x|z]  needs  to  be  specified  in  addition  to  one  for  V[n|z].  This  may  be 
possible  with  additional  structure.  In  particular,  for  a  linear  simultaneous  equations 
system  E[x,  |z(]  is  linear  in  z  so  that  estimation  is  based  on  E[x,m, /V[m,  |z,]]  =  0. 

For  linear  models  the  GMM  estimator  is  usually  based  on  the  simpler  condition 
E[z,m,  ]  =  0.  Given  this  condition,  the  optimal  GMM  estimator  defined  in  (6.43)  is  the 
most  efficient  GMM  estimator. 


6.4.3.  Alternative  Derivations  of  Two-Stage  Least  Squares 

The  2SLS  estimator,  the  standard  IV  estimator  for  overidentified  models,  was  derived 
in  Section  6.4.2  as  a  GMM  estimator. 

Here  we  present  three  other  derivations  of  the  2SLS  estimator.  One  of  these  deriva¬ 
tions,  due  to  Theil,  provided  the  original  motivation  for  2SLS,  which  predates  GMM. 
Theil’s  interpretation  is  emphasized  in  introductory  treatments.  However,  it  does  not 
generalize  to  nonlinear  models,  whereas  the  GMM  interpretation  does. 

We  consider  the  linear  model 


y  =  X/3  +  u,  (6.46) 

with  E[u|Z]  =  0  and  additionally  V[u|Z]  =  er2I. 


GLS  in  a  Transformed  Model 

Premultiplication  of  (6.46)  by  the  instruments  Tl  yields  the  transformed  model 


Z'y  =  Z'X/3  +  Z'u. 
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This  transformed  model  is  often  used  as  motivation  for  the  IV  estimator  when  r  =  K, 
since  ignoring  Z'u  since  TV-1  Z'u  -a-  0  and  solving  yields  (3  =  (Z'X)_1Z'y. 

Here  instead  we  consider  the  overidentified  case.  Conditional  on  Z  the  error  Z'u  has 
mean  zero  and  variance  a2Z'Z  given  the  assumptions  after  (6.46).  The  efficient  GLS 
estimator  of  (3  in  model  (6.46)  is  then 

3  =  [X'Z(er 2Z'Z)- 1  Z'X] ~ ‘  X'Z(cr  2Z'Z)_ 1  Z'y ,  (6.48) 

which  equals  the  2SLS  estimator  in  (6.44)  since  the  multipliers  a2  cancel  out.  More 
generally,  note  that  if  the  transformed  model  (6.47)  is  instead  estimated  by  WLS  with 
weighting  matrix  then  the  more  general  estimator  (6.37)  is  obtained. 


Theil’s  Interpretation 

Theil  (1953)  proposed  estimation  by  OLS  regression  of  the  original  model  (6.46), 
except  that  the  regressors  X  are  replaced  by  a  prediction  X  that  is  asymptotically  un¬ 
correlated  with  the  error  term. 

Suppose  that  in  the  reduced  form  model  the  regressors  X  are  a  linear  combination 
of  the  instruments  plus  some  error,  so  that 

X  =  Zn  +  v,  (6.49) 

where  II  is  a  K  x  r  matrix.  Multivariate  OLS  regression  of  X  on  Z  yields  estimator 
n  =  (Z'Zr'Z'X  and  OLS  predictions  X  =  ZLI  or 

X  =  PZX, 

where  Pz  =  Z(Z'Z)- 1 Z'.  OLS  regression  of  y  on  X  rather  than  y  on  X  yields  estimator 

3 Theil  =  (X'xr'x'y.  (6.50) 

Theil’s  interpretation  permits  computation  by  two  OLS  regressions,  with  the  first-stage 
OLS  giving  X  and  the  second-stage  OLS  giving  (3,  leading  to  the  term  two-stage  least- 
squares  estimator. 

To  establish  consistency  of  this  estimator  reexpress  the  linear  model  (6.46)  as 

y  =  X(3  +  (X— X)/3  +  u. 

The  second-stage  OLS  regression  of  y  on  X  yields  a  consistent  estimator  of  / 3  if  the  re¬ 
gressor  X  is  asymptotically  uncorrelated  with  the  composite  error  term  (X— X)/3  +  u. 
If  X  were  any  proxy  variable  there  is  no  reason  for  this  to  hold;  however,  here  X  is  un¬ 
correlated  with  (X— X)  as  an  OLS  prediction  is  orthogonal  to  the  OLS  residual.  Thus 
plim  V_1X'(X— X)/3  =  0.  Also, 

V_1X'u  =  7V_1X'Pzu  =  N~1X'Z(N~1Z'Z)~1  N~lZ'u. 

Then  X  is  asymptotically  uncorrelated  with  u  provided  Z  is  a  valid  instrument  so  that 
plim  Z'u  =  0.  This  consistency  result  for  /3Theil  depends  heavily  on  the  linearity 
of  the  model  and  does  not  generalize  to  nonlinear  models. 
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Theil’ s  estimator  in  (6.50)  equals  the  2SLS  estimator  defined  earlier  in  (6.44).  We 
have 

3Theii  =  (X'xr'x'y 

=  (X'P^PzXr'X'Pzy 
=  (X'PzXr'X'Pzy, 

the  2SLS  estimator,  using  P'z P?  =  Pz  in  the  final  equality. 

Care  is  needed  in  implementing  2SLS  using  Theil’s  method.  The  second-stage  OLS 
will  give  the  wrong  standard  errors,  even  if  errors  are  homoskedastic,  as  it  will  esti¬ 
mate  a1  using  the  second-stage  OLS  regression  residuals  (y  —  X/3)  rather  than  the  ac¬ 
tual  residuals  (y  —  X/3).  In  practice  one  may  also  make  adjustment  for  heteroskedastic 
errors.  It  is  much  easier  to  use  a  program  that  offers  2SLS  as  an  option  and  directly 
computes  (6.44)  and  the  associated  variance  matrix  given  in  Table  6.2. 

The  2SLS  interpretation  does  not  always  carry  over  to  nonlinear  models,  as  detailed 
in  Section  6.5.4.  The  GMM  interpretation  does,  and  for  this  reason  it  is  emphasized 
here  more  than  Theil’s  original  derivation  of  linear  2SLS. 

Theil  actually  considered  a  model  where  only  some  of  the  regressors  X  are  endoge¬ 
nous  and  the  remaining  are  exogenous.  The  preceding  analysis  still  applies,  provided 
all  the  exogenous  components  of  X  are  included  in  the  instruments  Z.  Then  the  first- 
stage  OLS  regression  of  the  exogenous  regressors  on  the  instruments  fits  perfectly  and 
the  predictions  of  the  exogenous  regressors  equal  their  actual  values.  So  in  practice  at 
the  first- stage  just  the  endogenous  variables  are  regressed  on  the  instruments,  and  the 
second-stage  regression  is  of  y  on  the  exogenous  regressors  and  the  first-stage  predic¬ 
tions  of  the  endogenous  regressors. 


Basmann’s  Interpretation 

Basmann  (1957)  proposed  using  as  instruments  the  OLS  reduced  form  predictions 
X  =  PzX  for  the  simple  IV  estimator  in  the  just-identified  case,  since  there  are  then 
exactly  as  many  instruments  X  as  regressors  X.  This  yields 

^Basmann  =  (X'Xi^X'y.  (6.51) 

This  is  consistent  since  plim  V”1  X'u  =  0,  as  already  shown  for  Theil’s  estimator. 

The  estimator  (6.51)  actually  equals  the  2SLS  estimator  defined  in  (6.44),  since 
X  =  X'PZ. 

This  IV  approach  will  lead  to  correct  standard  errors  and  can  be  extended  to  non¬ 
linear  settings. 


6.4.4.  Alternatives  to  Standard  IV  Estimators 

The  IV-based  optimal  GMM  and  2SLS  estimators  presented  in  Section  6.4.2  are  the 
standard  estimators  used  when  regressors  are  endogenous.  Chemozhukov  and  Hansen 
(2005)  present  an  IV  estimator  for  quantile  regression. 
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Here  we  briefly  discuss  leading  alternative  estimators  that  have  received  renewed 
interest  given  the  poor  finite-sample  properties  of  2SLS  with  weak  instruments  detailed 
in  Section  4.9.  We  focus  on  single-equation  linear  models.  At  this  stage  there  is  no 
method  that  is  relatively  efficient  yet  has  small  bias  in  small  samples. 


Limited-Information  Maximum  Likelihood 

The  limited-information  maximum  likelihood  (LIML)  estimator  is  obtained  by 
joint  ML  estimation  of  the  single  equation  (6.46)  plus  the  reduced  form  for  the  en¬ 
dogenous  regressors  in  the  right-hand  side  of  (6.46)  assuming  homoskedastic  normal 
errors.  For  details  see  Greene  (2003,  p.  402)  or  Davidson  and  MacKinnon  (1993, 
pp.  644-651).  More  generally  the  k  class  of  estimators  (see,  for  example,  Greene, 
2003,  p.  403)  includes  LIML,  2SLS,  and  OLS. 

The  LIML  estimator  due  to  Anderson  and  Rubin  (1949)  predates  the  2SLS  esti¬ 
mator.  Unlike  2SLS,  the  LIML  estimator  is  invariant  to  the  normalization  used  in  a 
simultaneous  equations  system.  Moreover,  LIML  and  2SLS  are  asymptotically  equiv¬ 
alent  given  homoskedastic  errors.  Yet  LIML  is  rarely  used  as  it  is  more  difficult  to 
implement  and  harder  to  explain  than  2SLS.  Bekker  (1994)  presents  small-sample  re¬ 
sults  for  LIML  and  a  generalization  of  LIML.  See  also  Hahn  and  Hausman  (2002). 


Split- Sample  IV 

Begin  with  Basmann’s  interpretation  of  2SLS  as  an  IV  estimator  given  in  (6.51).  Sub¬ 
stituting  for  y  from  (6.46)  yields 

3  =  /9  +  (X'Xr1X'u. 

By  assumption  plim N~l Z'u  =  0  so  plim  IV-1  X'u  =  0  and  (3  is  consistent.  However, 
correlation  between  X  and  u,  the  reason  for  IV  estimation,  means  that  X  =  P7X  is 
correlated  with  u.  Thus  E[  X'u  j  /  0,  which  leads  to  bias  in  the  IV  estimator.  This  bias 
arises  from  using  X  =  ZI1  rather  than  X  =  ZI1  as  the  instrument. 

An  alternative  is  to  instead  use  as  instrument  predictions  X,  which  have  the  property 
that  E[X'u]  =  0  in  addition  to  plim  /V  1  X'u  =  0,  and  use  estimator 

/3  =  (X'XT'X'y. 

Since  E[X'u]  =  0  does  not  imply  E[(X'X)_1X'u]  =  0,  this  estimator  will  still  be  bi¬ 
ased,  but  the  bias  may  be  reduced. 

Angrist  and  Krueger  (1995)  proposed  obtaining  such  instruments  by  splitting  the 
sample  into  two  subsamples  (yi,Xi,Zi)  and  (y2,X2,  Z2).  The  first  sample  is  used 
to  obtain  estimate  Hi  from  regression  of  X]  on  Z[.  The  second  sample  is  used  to 
obtain  the  IV  estimator  where  the  instrument  X2  =  Z1II1  uses  II 1  obtained  from  the 
separate  first  sample.  Angrist  and  Krueger  (1995)  define  the  unbiased  split-sample 
IV  estimator  as 

/^ussiv  =  (X2X2)  ’Xjya- 
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The  split-sample  IV  estimator  /3SSIV  =  (X1X2)  'X(y2  is  a  variant  based  on  Theil’s 
interpretation  of  2SLS.  These  estimators  have  finite-sample  bias  toward  zero,  unlike 
2SLS,  which  is  biased  toward  OLS.  However,  considerable  efficiency  loss  occurs  be¬ 
cause  only  half  the  sample  is  used  at  the  final  stage. 

Jackknife  IV 

A  more  efficient  variant  of  this  estimator  implements  a  similar  procedure  but  generates 
instruments  observation  by  observation. 

Let  the  subscript  (— i)  denote  the  leave-one-out  operation  that  drops  the  rth  obser¬ 
vation.  Then  for  the  rth  observation  we  obtain  estimate  II,  from  regression  of  X(_,)  on 
Z (_,•)  and  use  as  instrument  x-  =  z'  II, .  Repeating  N  times  gives  an  instrument  vector 
denoted  X(_()  with  rth  row  x. .  This  leads  to  the  jackknife  IV  estimator 

3JIV  =  (x;_,,x)-1x;_,,y2. 

This  estimator  was  originally  proposed  by  Phillips  and  Hale  (1977).  Angrist, 
Imbens  and  Krueger  (1999)  and  Blomquist  and  Dahlberg  (1999)  called  it  a  jackknife 
estimator  since  the  jackknife  (see  Section  1 1.5.5)  is  a  leave-one-out  method  for  bias 
reduction.  The  computational  burden  of  obtaining  the  N  jackknife  predicted  values  x- 
is  modest  by  use  of  the  recursive  formula  given  in  Section  11.5.5.  The  Monte  Carlo 
evidence  given  in  the  two  recent  papers  is  mixed,  however,  indicating  a  potential  for 
bias  reduction  but  also  an  increase  in  the  variance.  So  the  jackknife  version  may  not  be 
better  than  the  conventional  version  in  terms  of  mean-square  error.  The  earlier  paper 
by  Phillips  and  Hale  (1977)  presents  analytical  results  that  the  finite-sample  bias  of  the 
JIV  estimator  is  smaller  than  that  of  2SLS  only  for  appreciably  overidentified  models 
with  r  >  2 (K  +  1).  See  also  Hahn,  Hausman  and  Kuersteiner  (2001). 

Independently  Weighted  2SLS 

A  related  method  to  split-sample  IV  is  the  independently  weighted  GMM  estimator  of 
Altonji  and  Segal  (1996)  given  in  Section  6.3.5.  Splitting  the  sample  into  G  groups 
and  specializing  to  linear  IV  yields  the  independently  weighted  IV  estimator 

Awiv  =  ^E  [WE^x,]"1  x;zgs^)Z;y„ 
g=t 

where  S(_g)  is  computed  using  S  defined  in  (6.40)  except  that  observations  from  the 
gth  group  are  excluded.  In  a  panel  application  Ziliak  (1997)  found  that  the  indepen¬ 
dently  weighted  IV  estimator  performed  much  better  than  the  unbiased  split-sample 
IV  estimator. 


6.5.  Nonlinear  Instrumental  Variables 

Nonlinear  IV  methods,  notably  nonlinear  2SLS  proposed  by  Amemiya  (1974),  per¬ 
mit  consistent  estimates  of  nonlinear  regression  models  in  situations  where  the  NLS 
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estimator  is  inconsistent  because  to  regressors  are  correlated  with  the  error  term.  We 
present  these  methods  as  a  straightforward  extension  of  the  GMM  approach  for  linear 
models. 

Unlike  the  linear  case  the  estimators  have  no  explicit  formula,  but  the  asymptotic 
distribution  can  be  obtained  as  a  special  case  of  the  Section  6.3  results.  This  section 
presents  single-equation  results,  with  systems  results  given  in  Section  6.10.4.  A  fun¬ 
damentally  important  result  is  that  a  natural  extension  of  Theil’s  2SLS  method  for 
linear  models  to  nonlinear  models  can  lead  to  inconsistent  parameter  estimates  (see 
Section  6.5.4).  Instead,  the  GMM  approach  should  be  used. 

An  alternative  nonlinearity  can  arise  when  the  model  for  the  dependent  variable  is 
a  linear  model,  but  the  reduced  form  for  the  endogenous  regressor(s)  is  a  nonlinear 
model  owing  to  special  features  of  the  dependent  variable.  For  example,  the  endoge¬ 
nous  regressor  may  be  a  count  or  a  binary  outcome.  In  that  case  the  linear  methods 
of  the  previous  section  still  apply.  One  approach  is  to  ignore  the  special  nature  of  the 
endogenous  regressor  and  just  do  regular  linear  2SLS  or  optimal  GMM.  Alternatively, 
obtain  fitted  values  for  the  endogenous  regressor  by  appropriate  nonlinear  regression, 
such  as  Poisson  regression  on  all  the  instruments  if  the  endogenous  regressor  is  a  count, 
and  then  do  regular  linear  IV  using  this  fitted  value  as  the  instrument  for  the  count,  fol¬ 
lowing  Basmann’s  approach.  Both  estimators  are  consistent,  though  they  have  different 
asymptotic  distributions.  The  first  simpler  approach  is  the  usual  procedure. 


6.5.1.  Nonlinear  GMM  with  Instruments 

Consider  the  quite  general  nonlinear  regression  model  where  the  error  term  may  be 
additive  or  nonadditive  (see  Section  6.2.2).  Thus 

Ui  =  r(yi,Xi,  P),  (6.52) 

where  the  nonlinear  model  with  additive  error  is  the  special  case 

Uj  =  yj  -  g(xhf. 3),  (6.53) 

where  g(-)  is  a  specified  function.  The  estimators  given  in  Section  6.2.2  are  inconsis¬ 
tent  if  E[m;|x,-]  t^O. 

Assume  the  existence  of  r  instruments  z,  where  r  >  K,  that  satisfy 

E[n,jz,]  =  0.  (6.54) 

This  is  the  same  conditional  moment  condition  as  in  the  linear  case,  except  that  m,  = 
r(y, ,  x, .  (3)  rather  than  u,  =  y,  -  X)f3. 

Nonlinear  GMM  Estimator 

By  the  law  of  iterated  expectations,  (6.54)  leads  to 

E[z,m;]  =  0.  (6.55) 

The  GMM  estimator  minimizes  the  quadratic  form  in  the  corresponding  sample  mo¬ 
ment  condition. 
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In  matrix  notation  let  u  denote  the  N  x  1  error  vector  with  /ill  entry  m,  given  in 
(6.52)  and  let  Z  to  be  an  N  x  r  matrix  of  instruments  with  /th  row  z..  Then  JT  z,  n,  = 

Z'u  and  the  GMM  estimator  in  the  nonlinear  IV  model  /3GMM  minimizes 

Qn((3)  =  f^u'z)  WN  (JjZ'n)  .  (6-56) 

where  W,v  is  an  r  x  r  weighting  matrix.  Unlike  linear  GMM,  the  first-order  conditions 
do  not  lead  to  a  closed-form  solution  for  /3GMM. 


Distribution  of  Nonlinear  GMM  Estimator 

The  GMM  estimator  is  consistent  for  (3  given  (6.54)  and  asymptotically  normally  dis¬ 
tributed  with  estimated  asymptotic  variance 

V  [3gmm]  =  N  [D'ZWatZ'D]-1  [D'ZWuSW/vZ'D]  [D'ZW/vZ'D]-1  (6.57) 

using  the  results  from  Section  6.3.3  with  h(-)  =  z u,  where  S  is  given  in  the  following 
and  D  is  an  IV  x  K  matrix  of  derivatives  of  the  error  term 


(6.58) 


With  nonadditive  errors,  D  has  /th  row  dr(yi,  x,-,  f3)/d(3'\-^.  With  additive  errors,  D 
has  /th  row  dg(x,-,  (3)/d(3'\^  ,  ignoring  the  minus  sign  that  cancels  out  in  (6.57). 

For  independent  heteroskedastic  errors, 


s  =  n 

i 


(6.59) 


similar  to  the  linear  case  except  now  17,  =  r(y, ,  x,  (3)  or  m,  =  y,  —  g(x.  (3). 

The  asymptotic  variance  of  the  GMM  estimator  in  the  nonlinear  model  is  therefore 
the  same  as  that  in  the  linear  case  given  in  (6.39),  with  the  change  that  the  regressor 
matrix  X  is  replaced  by  the  derivative  du/d(3'\^.  This  is  exactly  the  same  change  as 
observed  in  Section  5.8  in  going  from  linear  to  nonlinear  least  squares.  By  analogy 
with  linear  IV,  the  rank  condition  for  identification  is  that  plim  N~}  Z'  dn/d(3'\j:>  is 
of  rank  K  and  the  weaker  order  condition  is  that  r  >  K. 


6.5.2.  Different  Nonlinear  GMM  Estimators. 

Two  leading  specializations  of  the  GMM  estimator,  which  differ  in  the  choice  of 
weighting  matrix,  are  optimal  GMM  that  sets  WjV  =  S  1  and  nonlinear  two-stage  least 
squares  (NL2SLS)  that  sets  W,v  =  (Z'Z)-1.  Table  6.3  summarizes  these  estimators 
and  their  associated  variance  matrices,  assuming  independent  heteroskedastic  errors, 
and  gives  results  for  general  W,v  and  results  for  nonlinear  IV  in  the  just-identified 
model. 
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Table  6.3.  GMM  Estimators  in  Nonlinear  IV  Model  and  Their  Asymptotic  Variance a 


Estimator  Definition  and  Asymptotic  Variance 


GMM 

Qgmm((3 )  =  uZWvZu 

(general  Wjv) 

Optimal  GMM 

V[/3]  =  ATD'ZW  aiZ'D]”  1  [D'ZW  a?SW  ajZ'D]  [D'ZW  atZ'D]- 
<2ogmm(/3)  =  u'ZS  Z'u 

(Wat  =  S-1) 

V[/3]  =  ATD'ZS^Z'Dr1 

NL2SLS 

Gnl2sls(/3)  =  U  Z(Z  Z)_1z  u 

(Wn  =  [N~1Z'Z]~1) 

V[/3]  =  A[D'Z(Z'Z)“1Z'D]-1[D'Z(Z'Z)“1S(Z'Z)"1Z'D] 

x  [D'Z(Z'Z)_1Z'D]_1 

V[/3]  =  s2[D'Z(Z'Z)_1Z'D]_1  if  homoskedastic  errors 


NLIV 

/3Nliv  solves  Z'u  =  0 

(just-identified) 

V[/3]  =  Al(Z'D)- 1  S(D'Z)- 1 

a  Equations  are  for  a  nonlinear  regression  model  with  error  u  defined  in  (6.53)  or  (6.52)  and  instruments  Z.  D 
is  the  derivative  of  the  error  vector  with  respect  to  (3'  evaluated  at  (3  and  simplifies  for  models  with  additive 
error  to  the  derivative  of  the  conditional  mean  function  with  respect  to  (3 '  evaluated  at  (3.  S  is  defined  in  (6.59). 
All  variance  matrix  estimates  assume  errors  that  are  independent  across  observations  and  heteroskedastic,  aside 
from  the  simplification  for  homoskedastic  errors  given  for  the  NL2SLS  estimator. 


Nonlinear  Instrumental  Variables 

In  the  just-identified  case  one  can  directly  use  the  sample  moment  conditions  corre¬ 
sponding  to  (6.55).  This  yields  the  method  of  moments  estimator  in  the  nonlinear 
IV  model  /3NLIV  that  solves 


1  N 

—  ^2  z jUi=  0,  (6.60) 

i= 1 

or  equivalently  Z'u  =  0  with  asymptotic  variance  matrix  given  in  Table  6.3. 

Nonlinear  estimators  are  often  computed  using  iterative  methods  that  obtain  an  op¬ 
timum  to  an  objective  function  rather  than  solve  nonlinear  systems  of  estimating  equa¬ 
tions.  For  the  just-identified  case  /3NLIV  can  be  computed  as  a  GMM  estimator  mini¬ 
mizing  (6.56)  with  any  choice  of  weighting  matrix,  most  simply  Wjy  =  I,  leading  to 
the  same  estimate. 


Optimal  Nonlinear  GMM 

For  overidentified  models  the  optimal  GMM  estimator  uses  weighting  matrix  Wn  = 
S-1.  The  optimal  GMM  estimator  in  the  nonlinear  IV  model  /3ogmm  therefore 
minimizes 

Qn(P)  =  (^U'Z)  S_1  (^Z'u)  •  (6-61) 

The  estimated  asymptotic  variance  matrix  given  in  Table  6.3  is  of  relatively  simple 
form  as  (6.57)  simplifies  when  Wat  =  S-1. 


195 


GENERALIZED  METHOD  OF  MOMENTS  AND  SYSTEMS  ESTIMATION 


As  in  the  linear  case  the  optimal  GMM  estimator  is  a  two-step  estimator  when  errors 
are  heteroskedastic.  In  computing  the  estimated  variance  one  can  use  S  as  presented 
in  Table  6.3,  but  it  is  more  common  to  instead  use  an  estimator  S,  say,  that  is  also 
computed  using  (6.59)  but  evaluates  the  residual  at  the  optimal  GMM  estimator  rather 
than  the  first-step  estimate  used  to  form  S  in  (6.61). 

Nonlinear  2SLS 

A  special  case  of  the  GMM  estimator  with  instruments  sets  Wn  =  (N^'Z'Z)-1  in 
(6.56).  This  gives  the  nonlinear  two-stage  least-squares  estimator  /3nl2sls  that 
minimizes 

Qn(P)  =  ^u'ZlZ'Zr'Z'u.  (6.62) 

This  estimator  has  the  attraction  of  being  the  optimal  GMM  estimator  if  errors  are 
homoskedastic,  as  then  S  =  s2Z'Z/N,  where  s2  is  a  consistent  estimate  of  the  constant 
V[m|z]  so  S_1  is  a  multiple  of  ( Z'Z)~l . 

With  homoskedastic  error  this  estimator  has  the  simpler  estimated  asymptotic  vari¬ 
ance  given  in  Table  6.3,  a  result  often  given  in  textbooks.  However,  in  microecono¬ 
metrics  applications  it  is  common  to  permit  heteroskedastic  errors  and  use  the  more 
complicated  robust  estimate  also  given  in  Table  6.3. 

The  NL2SLS  estimator,  proposed  by  Amemiya  (1974),  was  an  important  precursor 
to  GMM.  The  estimator  can  be  motivated  along  similar  lines  to  the  first  motivation 
for  linear  2SLS  given  in  Section  6.4.3.  Thus  premultiply  the  model  error  u  by  the 
instruments  Zl  to  obtain  Z'u,  where  E[Z'u  ]  =  0  since  E[u|Z]  =  0.  Then  do  nonlinear 
GLS  regression.  Assuming  homoskedastic  errors  this  minimizes 

Qn(/3)  =  u'Z[cr2Z'Z]-1Z'u, 

as  V[u|Z]  =  a2 1  implies  V[Z'u|Z]  =  a2 ZlZ.  This  objective  function  is  just  a  scalar 
multiple  of  (6.62). 

The  Theil  two-stage  interpretation  of  linear  2SLS  does  not  always  carry  over  to  non¬ 
linear  models  (see  Section  6.5.4).  Moreover,  NL2SLS  is  clearly  a  one-step  estimator. 
Amemiya  chose  the  name  NL2SLS  because,  as  in  the  linear  case,  it  permits  consistent 
estimation  using  instrumental  variables.  The  name  should  not  be  taken  literally,  and 
clearer  terms  are  nonlinear  IV  or  nonlinear  generalized  IV  estimation. 

Instrument  Choice  in  Nonlinear  Models 

The  preceding  estimators  presume  the  existence  of  instruments  such  that  E[w|z]  =  0 
and  that  estimation  is  best  if  based  on  the  unconditional  moment  condition  E[z«  ]  =  0. 

Consider  the  nonlinear  model  with  additive  error  so  that  u  =  y  —  g(x,  (3).  To  be 
relevant  the  instrument  must  be  correlated  with  the  regressors  x;  yet  to  be  valid  it 
cannot  be  a  direct  causal  variable  for  y.  From  the  variance  matrix  given  in  (6.57)  it  is 
actually  correlation  of  z  with  dg/d/3  rather  than  just  x  that  matters,  to  ensure  that  D'Z 
should  be  large.  Weak  instruments  concerns  are  just  as  relevant  here  as  in  the  linear 
case  studied  in  Section  4.9. 
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Given  likely  heteroskedasticity  the  optimal  moment  condition  on  which  to  base  es¬ 
timation,  given  E[m|z]  =  0,  is  not  E[zm]  =  0.  From  Section  6.3.7,  however,  the  optimal 
moment  condition  requires  additional  moment  assumptions  that  are  difficult  to  make, 
so  it  is  standard  to  use  E[zm]  =  0  as  has  been  done  here. 

An  alternative  way  to  control  for  heteroskedasticity  is  to  base  GMM  estimation  on 
an  error  term  defined  to  be  close  to  homoskedastic.  For  example,  with  count  data  rather 
than  use  u  =  y  —  exp  (x'/3),  work  with  the  standardized  error  u*  =  u /v/cxp  fx'/d) 
(see  Section  6.2.2).  Note,  however,  that  E[n*|z]  =  0  and  E[«|z]  =  0  are  different 
assumptions. 

Often  just  one  component  of  x  is  correlated  with  u.  Then,  as  in  the  linear  case,  the 
exogenous  components  can  be  used  as  instruments  for  themselves  and  the  challenge  is 
to  find  an  additional  instrument  that  is  uncorrelated  with  u.  There  are  some  nonlinear 
applications  that  arise  from  formal  economic  models  as  in  Section  6.2.7,  in  which  case 
the  many  subcomponents  of  the  information  set  are  available  as  instruments. 


6.5.3.  Poisson  IV  Example 

The  Poisson  regression  model  with  exogenous  regressors  specifies  E[v |x]  =  expix'/d). 
This  can  be  viewed  as  a  model  with  additive  error  u  =  y  —  exp(x'/3).  If  regressors 
are  endogenous  then  E[n|x]  /  0  and  the  Poisson  MLE  will  then  be  inconsistent.  Con¬ 
sistent  estimation  assumes  the  existence  of  instruments  z  that  satisfy  E[m|z]  =  0  or, 
equivalently, 

E[y  —  exp(x'/3)|z]  =  0. 


The  preceding  results  can  be  directly  applied.  The  objective  function  is 

Qn(P)  =  [N-1  Yj  z/ ui\  W N  [A-  Yi  Z'M']  - 
where  m,  =  y,  —  expfx'/3).  The  first-order  conditions  are  then 


[Z,  exp(x'/3)x,z'j  VV.v  [Y,  z'(T  -  exp(x'/3)) 


=  0. 


The  asymptotic  distribution  is  given  in  Table  6.3,  with  D'Z  =  JV  ex^x,z •  since 
dg/d/3  =  exp (x'(3)x  and  S  defined  in  (6.39)  with  h,  =  y,  —  expfx'/3).  The  opti¬ 
mal  GMM  and  NL2SLS  estimators  differ  in  whether  the  weighting  matrix  is  S-1  or 
(AV'Z'Zr1,  where  Z'Z  =  z,z'. 

An  alternative  consistent  estimator  follows  the  Basmann  approach.  First,  estimate 
by  OLS  the  reduced  form  x,  =  Ilz,  +  v,  giving  K  predictions  x,  =  Ilz, .  Second,  es¬ 
timate  by  nonlinear  IV  as  in  (6.60)  with  instruments  x,  rather  than  z, .  Given  the  OLS 
formula  for  II  this  estimator  solves 


[Yi  x'z']  [J2,  ZiZ'i\  _  exp(xj/3))z,]  =  0. 

This  estimator  differs  from  the  NL2SLS  estimator  because  the  first  term  in  the  left- 
hand  side  differs.  Potential  problems  with  instead  generalizing  Theil’s  method  for  lin¬ 
ear  models  are  detailed  in  the  next  section. 
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Similar  issues  arise  in  nonlinear  models  other  than  Poisson  regression,  such  as  mod¬ 
els  for  binary  data. 

6.5.4.  Two-Stage  Estimation  in  Nonlinear  Models 

The  usual  interpretation  of  linear  2SLS  can  fail  in  nonlinear  models.  Thus  suppose  y 
has  mean  g(x,  (3)  and  there  are  instruments  z  for  the  regressors  x.  Then  OLS  regression 
of  x  on  instruments  z  to  get  fitted  values  x  followed  by  NLS  regression  of  y  on  g(x,  (3) 
can  lead  to  inconsistent  parameter  estimates  of  (3,  as  we  now  demonstrate.  Instead,  one 
needs  to  use  the  NL2SLS  estimator  presented  in  the  previous  section. 

Consider  the  following  simple  model,  based  on  one  presented  in  Amemiya  (1984), 
that  is  nonlinear  in  variables  though  still  linear  in  parameters.  Let 

y  =  /3x2+u,  (6.63) 

X  —  JtZ  +  V, 

where  the  zero-mean  errors  u  and  v  are  correlated.  The  regressor  x2  is  endogenous, 
since  x  is  a  function  of  v  and  by  assumption  u  and  v  are  correlated.  As  a  result  the 
OLS  estimator  of  (5  is  inconsistent.  If  z  is  generated  independently  of  the  other  random 
variables  in  the  model  it  is  a  valid  instrument  as  it  is  clearly  then  independent  of  u  but 
correlated  with  x. 

The  IV  estimator  is  /ilv  =  (£V  ZjX2y  1  z,i )'< .  This  can  be  implemented  by  a  reg¬ 
ular  IV  regression  of  y  on  x2  with  instrument  z.  Some  algebra  shows  that,  as  expected, 
/3IV  equals  the  nonlinear  IV  estimator  defined  in  (6.60). 

Suppose  instead  we  perform  the  following  two-stage  least-squares  estimation. 
Lirst,  regress  x  on  z  to  get  x  =  zfz  and  then  regress  y  on  x2.  Then  /do.sr.s  = 
(')2i  x2x2y  1  ')2I  x2vM  where  x2  is  the  square  of  the  prediction  x,  obtained  from  OLS 
regression  of  x  on  z.  This  yields  an  inconsistent  estimate.  Adapting  the  proof  for  the 
linear  case  in  Section  6.4.3  we  have 

yt  =  fix2  +  ut 
=  fix2  +  Wi, 

where  u>,  =  (i(x2  —  x2)  +  it , .  An  OLS  regression  of  y,  on  x2  is  inconsistent  for  (J> 
because  the  regressor  x2  is  asymptotically  correlated  with  the  composite  error  term  Wj. 
Lormally,  (x2  —  x2)  =  ( nzt  +  u,)2  —  (ztZi)2  =  n2z2  +  2nziVj  +  v2  —  n2z2  implies, 
using  plim7r=7r  and  some  algebra,  that  plimV-1  x?(x2  —  xf)  =  plimlV-1 
71 2 2, 2 vf  T2  0  even  if  Zi  and  vt  are  independent.  Hence  plim  N  1  Yl,  xfu;i  i=-  plim 
N-1  J2ixfP(xi  -  V-)2  =  o. 

A  variation  that  is  consistent,  however,  is  to  regress  x2  rather  than  x  on  z  at  the  first 
stage  and  use  the  prediction  x2  /  (x)2  at  the  second  stage.  It  can  be  shown  that  this 
equals  /-j[v.  The  instrument  for  x2  needs  to  be  the  fitted  value  for  x2  rather  than  the 
square  of  the  fitted  value  for  x. 

This  example  generalizes  to  other  nonlinear  models  where  the  nonlinearity  is  in 
regressors  only,  so  that 


y  —  g(x)7/3  +  n, 
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Table  6.4.  Nonlinear  Two-Stage  Least-Squares  Example a 


Variable 

Estimator 

OLS 

NL2SLS 

Two-Stage 

x2 

1.189 

0.960 

1.642 

(0.025) 

(0.046) 

(0.172) 

R2 

0.88 

0.85 

0.80 

a  The  dgp  given  in  the  text  has  true  coefficient  equal  to  one.  The  sample 
size  is  N  =  200. 


where  g(x)  is  a  nonlinear  function  of  x.  Common  examples  are  use  of  powers  and  nat¬ 
ural  logarithm.  Suppose  E[«  |z]  =  0.  Inconsistent  estimates  are  obtained  by  regressing 
x  on  z  to  get  predictions  x,  and  then  regressing  y  on  g(x).  Consistent  estimates  can  be 
obtained  by  instead  regressing  g(x)  on  z  to  get  predictions  g(x),  and  then  regressing  y 
on  g(x)  at  the  second  stage.  We  use  g(x)  rather  than  g(x)  as  instrument  for  g(x).  Even 
then  the  second-stage  regression  gives  invalid  standard  errors  as  OLS  output  will  use 
residuals  u  =  y  —  g(x)'/3  rather  than  u  =  y  —  g(x)'/3.  It  is  best  to  directly  use  a  GMM 
or  NL2SLS  command. 

More  generally  models  may  be  nonlinear  in  both  variables  and  parameters.  Consider 
a  single-index  model  with  additive  error,  so  that 

y  =  g(x'/3)  +  u. 

Inconsistent  estimates  may  be  obtained  by  OLS  of  x  on  z  to  get  predictions  x,  and  then 
NLS  regression  of  y  on  g(xr (3).  Either  GMM  or  NL2SLS  needs  to  be  used.  Essentially, 
for  consistency  we  want  g(x'/3),  not  g(x'f3). 


NL2SLS  Example 

We  consider  NL2SLS  estimation  in  a  model  with  a  simple  nonlinearity  resulting  from 
the  square  of  an  endogenous  variable  appearing  as  a  regressor,  as  in  the  previous 
section. 

The  dgp  is  (6.63),  so  y  =  fix2  +  u  and  x  =  nz  +  v,  where  fl  =  1,  and  it  =  1,  and 
2  =  1  for  all  observations  and  (u,  v)  are  joint  normal  with  means  0,  variances  1,  and 
correlation  0.8.  A  sample  of  size  200  is  drawn.  Results  are  shown  in  Table  6.4. 

The  nonlinearity  here  is  quite  mild  with  the  square  of  x  rather  than  x  appearing  as 
regressor.  Interest  lies  in  estimating  its  coefficient  fl.  The  OLS  estimator  is  inconsis¬ 
tent,  whereas  NL2SLS  is  consistent.  The  two-stage  method  where  first  an  OLS  regres¬ 
sion  of  x  on  z  is  used  to  form  'x  and  then  an  OLS  regression  of  y  on  (x  )2  is  performed 
that  yields  an  estimate  that  is  more  than  two  standard  errors  from  the  true  value  of 
fl  =  1 .  The  simulation  also  indicates  a  loss  in  goodness  of  fit  and  precision  with  larger 
standard  errors  and  lower  R2,  similar  to  linear  IV. 
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6.6.  Sequential  Two-Step  m-Estimation 

Sequential  two-step  estimation  procedures  are  estimation  procedures  where  the  es¬ 
timate  of  a  parameter  of  ultimate  interest  is  based  on  initial  estimation  of  an  un¬ 
known  parameter.  An  example  is  feasible  GLS  when  the  error  has  conditional  vari¬ 
ance  exptz'7).  Given  an  estimate  7  of  7,  the  FGLS  estimator  /3  solves  J^!Li(yi  ~ 
x'/3)2/  cxptz'7).  A  second  example  is  the  Heckman  two-step  estimator  given  in  Sec¬ 
tion  16.10.2. 

These  estimators  are  attractive  as  they  can  provide  a  relatively  simple  way  to  obtain 
consistent  parameter  estimates.  However,  for  valid  statistical  inference  it  may  be  nec¬ 
essary  to  adjust  the  asymptotic  variance  of  the  second-step  estimator  to  allow  for  the 
first-step  estimation.  We  present  results  for  the  special  case  where  the  estimating  equa¬ 
tions  for  both  the  first-  and  second-step  estimators  set  a  sample  average  to  zero,  which 
is  the  case  for  m-estimators,  method  of  moments,  and  estimating  equations  estimators. 

Partition  the  parameter  vector  9  into  9\  and  92,  with  ultimate  interest  in  92.  The 
model  is  estimated  sequentially  by  first  obtaining  9\  that  solves  =  0  and 

then,  given  9\,  obtaining  02  that  solves  N  Ylf=i  h2,(#i,  #2)  =  0.  In  general  the  dis¬ 
tribution  of  62  given  estimation  of  0\  differs  from,  and  is  more  complicated  than, 
the  distribution  of  9 2  if  9\  is  known.  Statistical  inference  is  invalid  if  it  fails  to  take 
into  account  this  complication,  except  in  some  special  cases  given  at  the  end  of  this 
section. 

The  following  derivation  is  given  in  Newey  (1984),  with  similar  results  obtained  by 
Murphy  and  Topel  (1985)  and  Pagan  (1986).  The  two-step  estimator  can  be  rewritten 
as  a  one-step  estimator  where  (9 1,  92)  jointly  solve  the  equations 

N 

N~l  ^hjlw/,?!)  =  0,  (6.64) 

i= 1 
N 

A”1J]h2(w/,01,?2)  =  O. 

i= 1 

Defining  9  =  (9\  9\)'  and  h,  =  (hj;  h'2(.)',  we  can  write  the  equations  as 

N 

N~'Y^  htw,-,?)  =  0. 

1  =  1 

In  this  setup  it  is  assumed  that  dim(hi)  =dim(0i)  and  dim(h2)  =  dim(02),  so  that  the 
number  of  estimating  equations  equals  the  number  of  parameters.  Then  (6.64)  is  an 
estimating  equations  estimator  or  MM  estimator. 

Consistency  requires  that  plim  N  h(w,  ,  9q)  =  0,  where  9q  =  [0}o,  9\ 0].  This 
condition  should  be  satisfied  if  9\  is  consistent  for  9 10  in  the  first  step,  and  if  second- 
step  estimation  of  92  with  9 10  known  (rather  than  estimated  by  9\  )  would  lead  to 
a  consistent  estimate  of  9 2 o-  Within  a  method  of  moments  framework  we  require 
E[hi,(0i)]  =  0  and  E[h2,(0i,  #2)]  =  0.  We  assume  that  consistency  is  established. 

For  the  asymptotic  distribution  we  apply  the  general  result  that 

VN(9  -  90)  4  Af  [0,  Go  1S0(Go ')'] , 
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where  Go  and  So  are  defined  in  Proposition  6.1.  Partition  Go  and  So  in  a  similar  way 
to  the  partitioning  of  9  and  h, .  Then 


G0  = 


"ahii /dd\ 

0 

'Gn  0  ' 

_3h2,/30j 

3h2//36»;_ 

_G2i  G22_ 

using  3hi,(0)/302  =  0  since  hi,(0)  is  not  a  function  of  62  from  (6.64).  Since  Go,  Gn, 
and  G22  are  square  matrices 


Go  — 


Gn1 

~g^g2[g;I 


Clearly, 


S0  = 


hi/h,/ 

h|,h2('" 

Sn 

rd 

h2,h|,' 

h2,h2/' 

S2i 

1 

<N 

Cd 

The  asymptotic  variance  of  62  is  the  (2,  2)  submatrix  of  the  variance  matrix  of  6.  After 
some  algebra,  we  get 


V[?2]  =  G22 


S22  +  G2i[G(,1S11G(11]G'  I  , 

-GjiG^Sn-S^'G',  1  22 


(6.65) 


The  usual  computer  output  yields  standard  errors  that  are  incorrect  and  understate 
the  true  standard  errors,  since  V[6L  ]  is  then  assumed  to  be  G02'  S2>Gr,1 ,  which  can  be 
shown  to  be  smaller  than  the  true  variance  given  in  (6.65). 

There  is  no  need  to  account  for  additional  variability  in  the  second-step  caused  by 
estimation  in  the  first  step  in  the  special  case  that  E[3h2,(0)/3#i]  =  0,asthenG2i  =  0 
and  V[£L]  in  (6.65)  reduces  to  G2i,1S22G221. 

A  well-known  example  of  G2i  =  0  is  FGLS.  Then  for  heteroskedastic  errors 


h  2,(0)  = 


x2/(y;  ~  9 2) 

ct(x,-,  9X) 


where  V[y,  |x,]  =  cr2(x;,  6\),  and 


E[3h2i(0)/30i]  =  E 


(yi  ~  x' 9 2)  da(Xj,  0 1 ) 
X2'  ct(x,-,  0i)2  dQi 


which  equals  zero  since  E[y,  |x,]  =  x'6L.  Furthermore,  for  FGLS  consistency  of  02 
does  not  require  that  G\  be  consistent  since  E[h2l(0)]  =  0  just  requires  that  E[y,  |x,]  = 
x.02,  which  does  not  depend  on  6\. 

A  second  example  of  G2i  =  0  is  ML  estimation  with  a  block  diagonal  matrix  so  that 
Yi[d2C{9)/d6\dG'1]  =  0.  This  is  the  case  for  example  for  regression  under  normality, 
where  G\  are  the  variance  parameters  and  02  are  the  regression  parameters. 

In  other  examples,  however,  G2i  7^  0  and  the  more  cumbersome  expression  (6.65) 
needs  to  be  used.  This  is  done  automatically  by  computer  packages  for  some  standard 
two-step  estimators,  most  notably  Heckman’s  two-step  estimator  of  the  sample  selec¬ 
tion  model  given  in  Section  16.5.4.  Otherwise,  V[02]  needs  to  be  computed  manually. 
Many  of  the  components  come  from  earlier  estimation.  In  particular,  G  M 1 S 1 1 G , ,  is 
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the  robust  variance  matrix  of  9 i  and  G22  S22G22  is  the  robust  variance  matrix  esti¬ 
mate  of  62  that  incorrectly  ignores  the  estimation  error  in  9\.  For  data  independent 
over  i  the  subcomponents  of  the  So  submatrix  are  consistently  estimated  by  S jk  = 
JT  hjjhki',  j,  k  =  1,2.  This  leaves  computation  of  G21  =  N~l  JT  Shi/ /dd\  \g 
as  the  main  challenge. 

A  recommended  simpler  approach  is  to  obtain  bootstrap  standard  errors  (see  Sec¬ 
tion  16.2.5),  or  directly  jointly  estimate  9\  and  9i  in  the  combined  model  (6.64),  as¬ 
suming  access  to  a  GMM  routine. 

These  simpler  approaches  can  also  be  applied  to  sequential  estimators  that  are 
GMM  estimators  rather  than  m-estimators.  Then  combining  the  two  estimators  will 
lead  to  a  set  of  conditions  more  complicated  than  (6.64)  and  we  no  longer  get  (6.65). 
However,  one  can  still  bootstrap  or  estimate  jointly  rather  than  sequentially. 


6.7.  Minimum  Distance  Estimation 

Minimum  distance  estimation  provides  a  way  to  estimate  structural  parameters  6  that 
are  a  specified  function  of  reduced  form  parameters  7r,  given  a  consistent  estimate 
7?  of  7 r. 

A  standard  reference  is  Ferguson  (1958).  Rothenberg  (1973)  applied  this  method 
to  linear  simultaneous  equations  models,  though  the  alternative  methods  given  in  Sec¬ 
tion  6.9.6  are  the  standard  methods  used.  Minimum  distance  estimation  is  most  often 
used  in  panel  data  analysis.  In  the  initial  work  by  Chamberlain  (1982,  1984)  (see  Sec¬ 
tion  22.2.7)  he  lets  n  be  OLS  estimates  from  linear  regression  of  the  current-period 
dependent  variable  on  regressors  in  all  periods.  Subsequent  applications  to  covariance 
structures  (see  Section  22.5.4)  let  n  be  estimated  variances  and  autocovariances  of  the 
panel  data.  See  also  the  indirect  inference  method  (Section  12.6). 

Suppose  that  the  relationship  between  q  structural  parameters  and  r  >  q  reduced 
form  parameters  is  that  7r()  =  g(0o).  Further  suppose  that  we  have  a  consistent  estimate 
7r  of  the  reduced  form  parameters.  An  obvious  estimator  is  6  such  that  if  =  g (9),  but 
this  is  infeasible  since  q  <  r.  Instead,  the  minimum  distance  (MD)  estimator  #MD 
minimizes  with  respect  to  9  the  objective  function 

QN(9 )  =  (7?  -  g(0))'Ww(7r  -  g(0)),  (6.66) 

where  W/v  is  an  c  x  r  weighting  matrix. 

If  7r  — >  7Tq  and  W/v  Wo,  where  Wo  is  finite  positive  semidefinite  then 
Qn(9)  Qo(9)  =  (7r0— g(0)),Wo(7ro— g(0)).  It  follows  that  9q  is  locally  identified 

if  Rank[W0  x  fig (9)/d9']  =  q,  while  consistency  essentially  requires  that  7r0=  g(#o)- 

For  the  MD  estimator  '/N(9md  —  #o)  ->  A/"[O,V[0md]],  where 

V[?MD]  =  (GoW0G0)_1(GoW0V[7r]W0G0)(GoW0G0r1,  (6.67) 

Go  =  dg{9)/d9'\go,  and  it  is  assumed  that  the  reduced  form  parameters  n  have  limit 
distribution  r  —  7To)  — >  Af[0,V[if  ]].  More  efficient  reduced  form  estimators  lead 

to  more  efficient  MD  estimators,  since  smaller  V[7r]  leads  to  smaller  V[#md]  in  (6.67). 
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To  obtain  the  result  (6.67),  begin  with  the  following  rescaling  of  the  first-order 
conditions  for  the  MD  estimator: 

GN(d)'WNVN(n  -  g(?»  =  0,  (6.68) 

where  Gn(9 )  =  3g (9)/d9'.  An  exact  first-order  Taylor  series  expansion  about  #o 
yields 

Vivh(7f  -  g(0))  =  Vn(tt  -  7T0)  -  Gn(9+)Vn(6  -  90),  (6.69) 

where  9+  lies  between  9  and  9q  and  we  have  used  g(#o)  =  tto-  Substituting  (6.69) 
back  into  (6.68)  and  solving  for  \fN (9  —  9q)  yields 

VN(9  -  90)  =  [GN(9yWNGN(9+)TlGN(9)'WNVN(Tr  -  t r0),  (6.70) 

which  leads  directly  to  (6.67). 

For  given  reduced  form  estimator  7?,  the  most  efficient  MD  estimator  uses  weighting 
matrix  =  V[7r]  1  in  (6.66).  This  estimator  is  called  the  optimal  MD  (OMD) 
estimator,  and  sometimes  the  minimum  chi-square  estimator  following  Ferguson 
(1958). 

A  common  alternative  special  case  is  the  equally  weighted  minimum  distance 
(EWMD)  estimator,  which  sets  =  I.  This  is  less  efficient  than  the  OMD  estima¬ 
tor,  but  it  does  not  have  the  finite-sample  bias  problems  analogous  to  those  discussed 
in  Section  6.3.5  that  arise  when  the  optimal  weighting  matrix  is  used.  The  EWMD  es¬ 
timator  can  be  simply  obtained  by  NLS  regression  of  n j  on  gj(9),  j  =  1 . r,  since 

minimizing  (7r  —  g(0))'(7r  —  g(0))  yields  the  same  first-order  conditions  as  those  in 
(6.68)  with  WN  =  I. 

The  maximized  value  of  the  objective  function  for  the  OMD  is  chi-squared  dis¬ 
tributed.  Specifically, 

(IT  -  g(#OMD  )/ V [7?]_  1  (7?  -  g(?QMD))  (6.71) 

is  asymptotically  distributed  as  yr(r  —  q)  under  Hf)  :  g(#o)  —  rr0.  This  provides  a 
model  specification  test  analogous  to  the  OIR  test  of  Section  6.3.8. 

The  MD  estimator  is  qualitatively  similar  to  the  GMM  estimator.  The  GMM  frame¬ 
work  is  the  standard  one  employed.  MD  estimation  is  most  often  used  in  panel  studies 
of  covariance  structures,  since  then  7r  comprises  easily  estimated  sample  moments 
(variances  and  covariances)  that  can  then  be  used  to  obtain  9. 


6.8.  Empirical  Likelihood 

The  MM  and  GMM  approaches  do  not  require  complete  specification  of  the  con¬ 
ditional  density.  Instead,  estimation  is  based  on  moment  conditions  of  the  form 
E[h(y,  x.  9)]  =  0.  The  empirical  likelihood  approach,  due  to  Owen  (1988),  is  an  alter¬ 
native  estimation  procedure  based  on  the  same  moment  condition. 

An  attraction  of  the  empirical  likelihood  estimator  is  that,  although  it  is  asymptoti¬ 
cally  equivalent  to  the  GMM  estimator,  it  has  different  finite-sample  properties,  and  in 
some  examples  it  outperforms  the  GMM  estimator. 
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6.8.1.  Empirical  Likelihood  Estimation  of  Population  Mean 

We  begin  with  empirical  likelihood  in  the  case  of  a  scalar  iid  random  variable  y 
with  density  f(y)  and  sample  likelihood  function  /tv,).  The  complication  con¬ 
sidered  here  is  that  the  density  f(y)  is  not  specified,  so  the  usual  ML  approach  is  not 
possible. 

A  completely  nonparametric  approach  seeks  to  estimate  the  density  f(y)  evaluated 
at  each  of  the  sample  values  of  y.  Let  tt,  =  /(>’,)  denote  the  probability  that  the  ith 
observation  on  y  takes  the  realized  value  y, .  Then  the  goal  is  to  maximize  the  so- 
called  empirical  likelihood  function  J[( 7Tj,  or  equivalently  to  maximize  the  empirical 
log-likelihood  function  N  1  JT  hur,-,  which  is  a  multinomial  model  with  no  structure 
placed  on  i r,-.  This  log-likelihood  is  unbounded,  unless  a  constraint  is  placed  on  the 
range  of  values  taken  by  itj.  The  normalization  used  is  that  JT  jt,  =  1.  This  yields  the 
standard  estimate  of  the  cumulative  distribution  function  in  the  fully  nonparametric 
case,  as  we  now  demonstrate. 

The  empirical  likelihood  estimator  maximizes  with  respect  to  7r  and  tj  the 
Lagrangian 


where  it  =  [it\. . .  ]'  and  /;  is  a  Lagrange  multiplier.  Although  the  data  y,  do  not 

explicitly  appear  in  (6.72)  they  appear  implicitly  as  jr,  =  /(y,).  Setting  the  derivatives 
with  respect  to  tci  (i  =  1,  . . . ,  N),  and  rj  to  zero  and  solving  yields  jr,  =  1  /N  and  )]  = 
1 .  Thus  the  estimated  density  function  /(y)  has  mass  1  /N  at  each  of  the  realized  values 
yi,  i  =  I , ....  A.  The  resulting  distribution  function  is  F(y)  =  N~l  J^j  l(y  <  y,), 
where  1(A)  =  1  if  event  A  occurs  and  0  otherwise.  F(y)  is  just  the  usual  empirical 
distribution  function. 

Now  introduce  parameters.  As  a  simple  example,  suppose  we  introduce  the  moment 
restriction  that  E[y  —  /i  |  =  0,  where  /i  is  the  unknown  population  mean.  In  the  empir¬ 
ical  likelihood  context  this  population  moment  is  replaced  by  a  sample  moment,  where 
the  sample  moment  weights  sample  values  by  the  probabilities  i r,- .  Thus  we  introduce 
the  constraint  that  JT  jr,  (y,  —  /x)  =  0.  The  Lagrangian  for  the  maximum  empirical 
likelihood  estimator  is 

j  Af  /  N  \  N 

£el(7T,  T],  X,  /z)  =  —  ^2  lnni  ~  V  (  ^2  7X1  ~  1  )  “  A  X!  7Xi^yi  ~  ^6'73^ 

^  i=  1  \i= 1  /  /=1 

where  rj  and  /,  are  Lagrange  multipliers. 

Begin  by  differentiating  the  Lagrangian  with  respect  to  7 r,(i  =  I , . . . ,  N),  rj,  and 
A.  but  not  ji.  Setting  these  derivatives  to  zero  yields  equations  that  are  functions 
of  ji.  Solving  leads  to  the  solution  tt,  —  .t,  ( /i )  and  hence  an  empirical  likelihood 
N  1  JT  In  tt,  ( n )  that  is  then  maximized  with  respect  to  //.  This  solution  method  leads 
to  nonlinear  equations  that  need  to  be  solved  numerically. 

For  this  particular  problem  an  easier  way  to  solve  for  ji  is  to  note  that  the  max¬ 
imized  value  of  £(7t,  )?,  A,  /x)  must  be  less  than  or  equal  to  N~l  JT  In  A-1,  since 
this  is  the  maximized  value  without  the  last  constraint.  However,  C(tv,  rj,X,  fi)  equals 
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N  1  JT  In  N  1  if  it j  =  I  / N  and  fl  =  N  *  yt  =  y.  So  the  maximum  empirical 
likelihood  estimator  of  the  population  mean  is  the  sample  mean. 


6.8.2.  Empirical  Likelihood  Estimation  of  Regression  Par  ameters 

Now  consider  regression  data  that  are  iid  over  i.  The  only  structure  placed  on  the 
model  are  r  moment  conditions 


E[h(w,-.  0)]  —  0, 


(6.74) 


where  h(-)  and  w,  are  defined  in  Section  6.3.1.  For  example,  h(w,  0)  =  x(y  —  x  9 )  for 
OLS  estimation  and  h(y,  x,  9)  =  (dg/d9)(y  —  g(x  9))  for  NLS  estimation. 

The  empirical  likelihood  approach  maximizes  the  empirical  likelihood  function 
N  1  1 11 -T  subject  to  the  constraint  JT  n,  =  1  (see  (6.72))  and  the  additional  sam¬ 

ple  constraint  based  on  the  population  moment  condition  (6.74)  that 


N 

T,  7r,h(w, .  9)  =  0. 
1  =  1 


(6.75) 


Thus  we  maximize  with  respect  to  7r,  A,  and  9 

1  n  In  \  n 

£EL(7r,  T),  A,  9)  —  InjTj  -  r)  I  Ttj  -  1  I  -  A'  ^  7r,-h(w(-,  9),  (6.76) 

™  i= l  \i=l  /  i=l 


where  the  Lagrangian  multipliers  are  a  scalar  rj  and  column  vector  A  of  the  same 
dimension  as  h(-). 

First,  concentrate  out  the  N  parameters  7t\, . . . ,  ttn-  Differentiating  C(tt,  i/,  A,  9) 
with  respect  to  tt,  yields  1  /( Nnj )  —  rj  —  A  h,  =  0.  Then  we  obtain  r]  =  1  by  multiply¬ 
ing  by  jtj  and  summing  over  i  and  using  JT  7r,h,  =  0.  It  follows  that 


jti(9,  A)  = 


1 

7V(1  +  A'h(w,-,0))' 


(6.77) 


The  problem  is  now  reduced  to  a  maximization  problem  with  respect  to  (r  +  q)  vari¬ 
ables  A  and  9,  the  Lagrangian  multipliers  associated  with  the  r  moment  conditions 
(6.74),  and  the  q  parameters  9. 

Solution  at  this  stage  requires  numerical  methods,  even  for  just-identified  mod¬ 
els.  One  can  maximize  with  respect  to  9  and  A  the  function  N  1  JT  ln[  I  /  AT  I  + 
A'h(w,,  9))\. 

Alternatively,  first  concentrate  out  A.  Differentiating  £(7t(0,  A),  A)  with  respect 
to  A  yields  7t,h,  =  0.  Define  A (9)  to  be  the  implicit  solution  to  the  system  of 
dim(A)  equations 


E 


i 

A(1  +  A'h(w, ,  9)) 


h(w, ,  9)  =  0. 


In  implementation  numerical  methods  are  needed  to  obtain  X(9).  Then  (6.77)  becomes 


7Ti(0) 


1 

ATI  +  A(0)'h(w,-,0))' 


(6.78) 
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By  substituting  (6.78)  into  the  empirical  likelihood  function  N  1  JV  I n  ni •  the  empir¬ 
ical  log-likelihood  function  evaluated  at  6  becomes 

N 

Cel(6)  =  -IV-1  £ln[JV(l  +  A(0)'h(w,,  0))]. 

1  =  1 

The  maximum  empirical  likelihood  (MEL)  estimator  0MEL  maximizes  this  function 
with  respect  to  6. 

Qin  and  Lawless  (1994)  show  that 


Vn(6 mel  -  e0)  4  Af[0,  A(0o)-1B(0o)A(0o)''1], 


where  A(#o)  =  plimE[3h(0)/30/|0o]  and  B(#o)  =  plimE[h(0)h(0y|eo].  This  is  the 
same  limit  distribution  as  the  method  of  moments  (see  (6.13)).  In  finite  samples  0MEL 
differs  from  #gmm=  however,  and  inference  is  based  on  sample  estimates 


-j- 

A  =  E,„ 


B  =  J2;=1  4h,(0)h,(0)' 


that  weight  by  the  estimated  probabilities  if,  rather  than  the  proportions  1  /N. 

Imbens  (2002)  provides  a  recent  survey  of  empirical  likelihood  that  contrasts  em¬ 
pirical  likelihood  with  GMM.  Variations  include  replacing  AV1  JThur,  in  (6.26) 
by  V-1  JT  7tj  In  t r,-.  Empirical  likelihood  is  computationally  more  burdensome;  see 
Imbens  (2002)  for  a  discussion.  The  advantage  is  that  the  asymptotic  theory  provides 
a  better  finite-sample  approximation  to  the  distribution  of  the  empirical  likelihood  es¬ 
timator  than  it  does  to  that  for  the  GMM  estimator.  This  is  pursued  further  in  Sec¬ 
tion  1 1.6.4. 


6.9.  Linear  Systems  of  Equations 

The  preceding  estimation  theory  covers  single-equation  estimation  methods  used  in 
the  majority  of  applied  studies.  We  now  consider  joint  estimation  of  several  equations. 
Equations  linear  in  parameters  with  an  additive  error  are  presented  in  this  section,  with 
extensions  to  nonlinear  systems  given  in  the  subsequent  section. 

The  main  advantage  of  joint  estimation  is  the  gain  in  efficiency  that  results  from 
incorporation  of  correlation  in  unobservables  across  equations  for  a  given  individual. 
Additionally,  joint  estimation  may  be  necessary  if  there  are  restrictions  on  parameters 
across  equations.  With  exogenous  regressors  systems  estimation  is  a  minor  extension 
of  single-equation  OLS  and  GLS  estimation,  whereas  with  endogenous  regressors  it  is 
single-equation  IV  methods  that  are  adapted. 

One  leading  example  is  systems  of  equations  such  as  those  for  observed  demand  of 
several  commodities  at  a  point  in  time  for  many  individuals.  For  seemingly  unrelated 
regression  all  regressors  are  exogenous  whereas  for  simultaneous  equations  models 
some  regressors  are  endogenous. 
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A  second  leading  example  is  panel  data,  where  a  single  equation  is  observed  at 
several  points  in  time  for  many  individuals,  and  each  time  period  is  treated  as  a  separate 
equation.  By  viewing  a  panel  data  model  as  an  example  of  a  system  it  is  possible  to 
improve  efficiency,  obtain  panel  robust  standard  errors,  and  derive  instruments  when 
some  regressors  are  endogenous. 

Many  econometrics  texts  provide  lengthy  presentations  of  linear  systems.  The  treat¬ 
ment  here  is  very  brief.  It  is  mainly  directed  toward  generalization  to  nonlinear  systems 
(see  Section  6.10)  and  application  to  panel  data  (see  Chapters  21-23). 

6.9.1.  Linear  Systems  of  Equations 

The  single-equation  linear  model  is  given  by  y,  =  x-/3  +  n, ,  where  y,  and  u,  are  scalars 
and  x,  and  f3  are  column  vectors.  The  multiple -equation  linear  model,  or  multivari¬ 
ate  linear  model,  with  G  dependent  variables  is  given  by 

y;  =  X;/3  +  u/,  i  —  l, . . . ,  N,  (6.79) 

where  y,  and  u,  are  G  x  1  vectors,  X,  is  a  G  x  K  matrix,  and  [3  is  a  K  x  1  column 
vector. 

Throughout  this  section  we  make  the  cross-section  assumption  that  the  error  vector 
u,  is  independent  over  i,  so  E[u,  u';  ]  =  0  for  i  ^  j.  However,  components  of  u,  for 
given  i  may  be  correlated  and  have  variances  and  covariances  that  vary  over  i,  leading 
to  conditional  error  variance  matrix  for  the  ith  individual 

ni  =  E[uiu'i\Xi].  (6.80) 

There  are  various  ways  that  a  multiple-equation  model  may  arise.  At  one  extreme 
the  seemingly  unrelated  equations  model  combines  G  equations,  such  as  demands  for 
different  consumer  goods,  where  parameters  vary  across  equations  and  regressors  may 
or  may  not  vary  across  equations.  At  the  other  extreme  the  linear  panel  data  combines 
G  periods  of  data  for  the  same  equation,  with  parameters  that  are  constant  across 
periods  and  regressors  that  may  or  may  not  vary  across  periods.  These  two  cases  are 
presented  in  detail  in  Sections  6.9.3  and  6.9.4. 

Stacking  (6.79)  over  N  individuals  gives 


yi 

-xr 

Ul 

— 

(3  + 

_y  n. 

_XN_ 

_UN_ 

or 


y  =  X/3  +  u,  (6.82) 

where  y  and  u  are  NG  x  1  vectors  and  X  is  a  AG  x  K  matrix. 

The  results  given  in  the  following  can  be  obtained  by  treating  the  stacked  model 
(6.82)  in  the  same  way  as  in  the  single-equation  case.  Thus  the  OLS  estimator  is  /3  = 
(X'X)  'X'v  and  in  the  just-identified  case  with  instrument  matrix  Z  the  IV  estimator 
is  (3  =  (Z'Xr'Z'y.  The  only  real  change  is  that  the  usual  cross-section  assumption  of 
a  diagonal  error  variance  matrix  is  replaced  by  assumption  of  a  block-diagonal  error 
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matrix.  This  block-diagonality  needs  to  be  accommodated  in  computing  the  estimated 
variance  matrix  of  a  systems  estimator  and  in  forming  feasible  GLS  estimators  and 
efficient  GMM  estimators. 


6.9.2.  Systems  OLS  and  FGLS  Estimation 

An  OLS  estimation  of  the  system  (6.82)  yields  the  systems  OLS  estimator 
(X'X)-1X'y.  Using  (6.81)  it  follows  immediately  that 


Psols  — 


Ex'x> 


-l 


Ex^- 


(6.83) 


The  estimator  is  asymptotically  normal  and,  assuming  the  data  are  independent  over  i , 
the  usual  robust  sandwich  result  applies  and 


V  [/3sols]  — 


Ex'x« 


/= l 


Ex^x 


/= l 


Ex-x' 


i=i 


(6.84) 


where  u,  =  y,  —  X,  /3.  This  variance  matrix  estimate  permits  conditional  variances  and 
covariances  of  the  errors  to  differ  across  individuals. 

Given  correlation  of  the  components  of  the  error  vector  for  a  given  individual, 
more  efficient  estimation  is  possible  by  GLS  or  FGLS.  If  observations  are  indepen¬ 
dent  over  i,  the  systems  GLS  estimator  is  systems  OLS  applied  to  the  transformed 
system 

n;l/2yi  =  nT1/2Xi/3  +  iV1/2u,-,  (6.85) 


—  1/2 

where  1 2,  is  the  error  variance  matrix  defined  in  (6.80).  The  transformed  error  12(  ffi, 
has  mean  zero  and  variance 


x, 


=  n;l/2E  [u'u,|x,]o: 


1/2 


=  Gr1/2G  ,G71/2 
=  Ic- 


So  the  transformed  system  has  errors  that  are  homoskedastic  and  uncorrelated  over  G 
equations  and  OLS  is  efficient. 

To  implement  this  estimator,  a  model  for  $2,  needs  to  be  specified,  say  $2,  =  fi,  ( 7). 
Then  perform  systems  OLS  estimation  in  the  transformed  system  where  12,  is  replaced 
by  12,  (7),  where  7  is  a  consistent  estimate  of  7.  This  yields  the  systems  feasible  GLS 
(SFGLS)  estimator 


/^SFGLS  — 


Ex^r' 


X; 


1  =  1 


-1 


Ex^rS 


y<- 


(6.86) 
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This  estimator  is  asymptotically  normal  and  to  guard  against  possible  misspecification 
of  $7,  (7)  we  can  use  the  robust  sandwich  estimate  of  the  variance  matrix 


V  [/3sfgls]  — 


-1-1 


i= 1 


E 

i=  1 


X;f2:  Ufffifir  X,- 


-1-1 


Ex^ 


-1 


X,; 


i=l 


(6.87) 


where  F2,  =  fl,  (7). 

The  most  common  specification  used  for  F2,  is  to  assume  that  it  does  not  vary  over 
i.  Then  22,  =  22  is  a  G  x  G  matrix  that  can  be  consistently  estimated  for  finite  G  and 
N  — >  00  by 


1  N 

^=T7  E  (6’88) 

(=1 

where  u,  =  y,-  —  X,/3SOLS.  Then  the  SFGLS  estimator  is  (6.86)  with  12  instead  of  12,, 
and  after  some  algebra  the  SFGLS  estimator  can  also  be  written  as 

3sfgls  =  [x'  (l2  *  ®  IN)  x]"‘  X'  (n-1  ®  I,v)  y',  (6.89) 

where  G>  denotes  the  Kronecker  product.  The  assumption  that  12,  =  12  rules  out,  for 
example,  heteroskedasticity  over  i .  This  is  a  strong  assumption,  and  in  many  applica¬ 
tions  it  is  best  to  use  robust  standard  errors  calculated  using  (6.87),  which  gives  correct 
standard  errors  even  if  12,  does  vary  over  i . 


6.9.3.  Seemingly  Unrelated  Regressions 

The  seemingly  unrelated  regressions  (SUR)  model  specifies  the  gth  of  G  equations 
for  the  ith  of  N  individuals  to  be  given  by 

ytg  =^igPg+u‘g’  g  =  l,  ...,G,  i  =  (6.90) 

where  x,?  are  regressors  that  are  assumed  to  be  exogenous  and  / 3  arc  K,,  x  I  param¬ 
eter  vectors.  For  example,  for  demand  data  on  G  goods  for  N  individuals,  ylg  may 
be  the  ith  individual’s  expenditure  on  good  g  or  budget  share  for  good  g.  In  all  that 
follows  G  is  assumed  fixed  and  reasonably  small  while  N  — >  00.  Note  that  we  use  the 
subscript  order  yig  as  results  then  transfer  easily  to  panel  data  with  variable  yit  (see 
Section  6.9.4).  Other  authors  use  the  reverse  order  ygi. 

The  SUR  model  was  proposed  by  Zellner  (1962).  The  term  seemingly  unrelated 
regressions  is  deceptive,  as  clearly  the  equations  are  related  if  the  errors  ulg  in  different 
equations  are  correlated.  For  the  SUR  model  the  relationship  between  ylg  and  y,/,  is 
indirect;  it  comes  through  correlation  in  the  errors  across  different  equations. 

Estimation  combines  observations  over  both  equations  and  individuals.  For  microe¬ 
conometrics  applications,  where  independence  over  i  is  assumed,  it  is  most  convenient 
to  first  stack  all  equations  for  a  given  individual.  Stacking  all  G  equations  for  the  ith 
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individual  we  get 


0  0  /3j  iij  i 

0  •  0  +  : 

0  0  x!cJ  L/^cJ  L  uiG 


(6.91) 


which  is  of  the  form  y,  =  X,/3  +  u,  in  (6.79),  where  y,  and  u,  are  G  x  1  vectors 
with  gth  entries  and  Ujg,  X,  is  a  G  x  /f  matrix  with  gth  row  [0-  •  •  x-  •  •  •  0],  and 
/3  =  [/3, . . .  /3g]'  is  a  K  x  1  vector  where  K  =  K\  +  •  •  •  XG.  Some  authors  instead 
first  stack  all  individuals  for  a  given  equation,  leading  to  different  algebraic  expressions 
for  the  same  estimators. 

Given  the  definitions  of  X,  and  y,  it  is  easy  to  show  that  /3SOLS  in  (6.83)  is 


3i]  [E,=ixnxn]  E^i  xi  i  >i  i 

L[Ef=ixmx;G]^E,=ixmY,G 


so  that  systems  OLS  is  the  same  as  separate  equation-by-equation  OLS.  As  might  be 
expected  a  priori,  if  the  only  link  across  equations  is  the  error  and  the  errors  are  treated 
as  being  uncorrelated  then  joint  estimation  reduces  to  single-equation  estimation. 

A  better  estimator  is  the  feasible  GLS  estimator  defined  in  (6.86)  using  il  in  (6.88) 
and  statistical  inference  based  on  the  asymptotic  variance  given  in  (6.87).  This  estima¬ 
tor  is  generally  more  efficient  than  systems  OLS,  though  it  can  be  shown  to  collapse 
to  OLS  if  the  errors  are  uncorrelated  across  equations  or  if  exactly  the  same  regressors 
appear  in  each  equation. 

Seemingly  unrelated  regression  models  may  impose  cross-equation  parameter 
restrictions.  For  example,  a  symmetry  restriction  may  imply  that  the  coefficient  of 
the  second  regressor  in  the  first  equation  equals  the  coefficient  of  the  first  regressor 
in  the  second  equation.  If  such  restrictions  are  equality  restrictions  one  can  easily 
estimate  the  model  by  appropriate  redefinition  of  X,  and  / 3  given  in  (6.79).  For  ex¬ 
ample,  if  there  are  two  equations  and  the  restriction  is  that  (32  =  —fix  then  define 
X,  =  [xji  —  x,2]/  and  (3  =  /3l.  Alternatively,  one  can  estimate  using  systems  exten¬ 
sions  of  single-equation  OLS  and  GLS  with  linear  restrictions  on  the  parameters. 

Also,  in  systems  of  equations  it  is  possible  that  the  variance  matrix  of  the  error 
vector  Uj  is  singular,  as  a  result  of  adding-up  constraints.  For  example,  suppose  yig 
is  the  ith  budget  share,  and  the  model  is  yig  =  ag  +  z' (3g  +  uig,  where  the  same  re¬ 
gressors  appear  in  each  equation.  Then  ^  yig  =  1  since  budget  shares  sum  to  one, 
which  requires  ag  =  1,  Eg  @g  =  0.  and  Eff  uig  —  0-  The  last  restriction  means 
n,  is  singular  and  hence  noninvertible.  One  can  eliminate  one  equation,  say  the  last, 
and  estimate  the  model  by  systems  estimation  applied  to  the  remaining  G  —  1  equa¬ 
tions.  Then  the  parameter  estimates  for  the  Gth  equation  can  be  obtained  using  the 
adding-up  constraint.  For  example,  «G  =  1  —  (a  \  +  ■  ■  ■  +  ac-i)-  It  is  also  possible 
to  impose  equality  restrictions  on  the  parameters  in  this  setup.  A  literature  exists  on 
methods  that  ensure  that  estimates  obtained  are  invariant  to  the  equation  deleted;  see, 
for  example,  Berndt  and  Savin  (1975). 
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6.9.4.  Panel  Data 


Another  leading  application  of  systems  GLS  methods  is  to  panel  data,  where  a  scalar 
dependent  variable  is  observed  in  each  of  T  time  periods  for  N  individuals.  Panel  data 
can  be  viewed  as  a  system  of  equations,  either  T  equations  for  N  individuals  or  N 
equations  for  T  time  periods.  In  microeconometrics  we  assume  a  short  panel,  with  T 
small  and  N  — >  oo  so  it  is  natural  to  set  it  up  as  a  scalar  dependent  variable  yit,  where 
the  gth  equation  in  the  preceding  discussion  is  now  interpreted  as  the  rth  time  period 
and  G  =  T. 

A  simple  panel  data  model  is 


yit  =  x'it(3  + uit,  t  =  1,  . . . ,  T,  i  =  1, . . . ,  N,  (6.92) 


a  specialization  of  (6.90)  with  (3  now  constant.  Then  in  (6.79)  the  regressor  matrix 
becomes  X,  =  [x,  i  •  •  •  x,y  ]'.  After  some  algebra  the  systems  OLS  estimator  defined  in 
(6.83)  can  be  reexpressed  as 
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(6.93) 


This  estimator  is  called  the  pooled  OLS  estimator  as  it  pools  or  combines  the  cross- 
section  and  time-series  aspects  of  the  data. 

The  pooled  estimator  is  obtained  simply  by  OLS  estimation  of  y,  t  on  x,r.  However, 
if  uu  are  correlated  over  t  for  given  i,  the  default  OLS  standard  errors  that  assume 
independence  of  the  error  over  both  i  and  t  are  invalid  and  can  be  greatly  downward 
biased.  Instead,  statistical  inference  should  be  based  on  the  robust  form  of  the  co- 
variance  matrix  given  in  (6.84).  This  is  detailed  in  Section  21.2.3.  In  practice  models 
more  complicated  than  (6.92)  that  include  individual  specific  effects  are  estimated  (see 
Section  21.2). 


6.9.5.  Systems  IV  Estimation 

Estimation  of  a  single  linear  equation  with  endogenous  regressors  was  presented 
in  Section  6.4.  Now  we  extend  this  to  the  multivariate  linear  model  (6.79)  when 
E[urjX,]  /  0.  Brundy  and  Jorgenson  (1971)  considered  IV  estimation  applied  to  the 
system  of  equations  to  produce  estimates  that  are  both  consistent  and  efficient. 

We  assume  the  existence  of  a  G  x  r  matrix  of  instruments  Z,  that  satisfy  E[u,  |Z,  ]  = 
0  and  hence 


E[Z;.(y,  -  X,/3)]  =  0. 


(6.94) 


These  instruments  can  be  used  to  obtain  consistent  parameter  estimates  using  single¬ 
equation  IV  methods,  but  joint  equation  estimation  can  improve  efficiency.  The  sys¬ 
tems  GMM  estimator  minimizes 


Qn(P)  = 


£z;(y  i-Xi/3) 


1  =  1 


WA 


£z;(y,-X,/3) 


1  =  1 


(6.95) 
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where  W,v  is  an  r  x  r  weighting  matrix.  Performing  some  algebra  yields 

3sgmm  =  [X'ZW.vZ'X]-1  [X'ZWjrZ'y] ,  (6.96) 

where  X  is  an  NG  x  K  matrix  obtained  by  stacking  Xi, . . . ,  XN  (see  (6.81))  and  Z 
is  an  NG  x  r  matrix  obtained  by  similarly  stacking  Z\, . . .  ,ZN.  The  systems  GMM 
estimator  has  exactly  the  same  form  as  (6.37),  and  the  asymptotic  variance  matrix  is 
that  given  in  (6.39).  It  follows  that  a  robust  estimate  of  the  variance  matrix  is 

V[3sgmm]  =  N  [X'ZW^Z'X]-1  [X'ZW^SW^Z'X]  [X'ZWwZ'X]M  ,  (6.97) 

where,  in  the  systems  case  and  assuming  independence  over  i, 

1  N 

S  =  ^EZ^Z<-  (6-98) 

i=i 

Several  choices  of  weighting  matrix  receive  particular  attention. 

First,  the  optimal  systems  GMM  estimator  is  (6.96)  with  Ww  =  S_1,  where  S  is 
defined  in  (6.98).  The  variance  matrix  then  simplifies  to 

V[3osgmm]  =  N  [X'ZS-’Z'X]-1  . 

This  estimator  is  the  most  efficient  GMM  estimator  based  on  moment  conditions 
(6.94).  The  efficiency  gain  arises  from  two  factors:  (1)  systems  estimation,  which  per¬ 
mits  enors  in  different  equations  to  be  correlated,  so  that  V[u,  |Z,]  is  not  restricted  to 
being  block  diagonal,  and  (2)  an  allowance  for  quite  general  heteroskedasticity  and 
correlation,  so  that  f2,  can  vary  over  i . 

Second,  the  systems  2SLS  estimator  arises  when  WN  =  (N~]Z'Z)  *.  Consider 
the  SUR  model  defined  in  (6.91),  with  some  of  the  regressors  x,g  now  endogenous. 
Then  systems  2SLS  reduces  to  equation-by-equation  2SLS,  with  instruments  z„  for 
the  gth  equation,  if  we  define  the  instrument  matrix  to  be 


zji  0  0 

0  .  0 
0  0  z'ic 


(6.99) 


In  many  applications  zj  =  zi  =  •  •  •  =  z„  so  that  a  common  set  of  instruments  is  used 
in  all  equations,  but  we  need  not  restrict  analysis  to  this  case.  For  the  panel  data  model 
(6.92)  systems  2SLS  reduces  to  pooled  2SLS  if  we  define  Z,  =  [z,  ]-  •  •  z ir\. 

Third,  suppose  that  V[u,  |Z,]  does  not  vary  over  i,  so  that  V[u,  |Z,]  =  $2.  This  is  a 
systems  analogue  of  the  single-equation  assumption  of  homoskedasticity.  Then  as  with 
(6.88)  a  consistent  estimate  of  $2  is  $2  =  N  1  u,  u',  where  u,  are  residuals  based 
on  a  consistent  IV  estimator  such  as  systems  2SLS.  Then  the  optimal  GMM  estimator 
is  (6.96)  with  Wat  =  I#  (g>  Q.  This  estimator  should  be  contrasted  with  the  three-stage 
least-squares  estimator  presented  at  the  end  of  the  next  section. 
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6.9.6.  Lineal-  Simultaneous  Equations  Systems 

The  linear  simultaneous  equations  model,  introduced  in  Section  2.4,  is  a  very  impor¬ 
tant  model  that  is  often  presented  in  considerable  length  in  introductory  graduate-level 
econometrics  courses.  In  this  section  we  provide  a  very  brief  self-contained  summary. 
The  discussion  of  identification  overlaps  with  that  in  Chapter  2.  Due  to  the  presence 
of  endogenous  variables  OLS  and  SUR  estimators  are  inconsistent.  Consistent  estima¬ 
tion  methods  are  placed  in  the  context  of  GMM  estimation,  even  though  the  standard 
methods  were  developed  well  before  GMM. 

The  linear  simultaneous  equations  model  specifies  the  gth  of  G  equations  for  the 
ith  of  N  individuals  to  be  given  by 

yig  =  l!ig'yg  +  Y'igPg  +  uig’  g  =  i,...,G ,  (6.100) 

where  the  order  of  subscripts  is  that  of  Section  6.9  rather  than  Section  2.4,  zg  is 
a  vector  of  exogenous  regressors  that  are  assumed  to  be  uncorrelated  with  the  er¬ 
ror  term  ug  and  Yg  is  a  vector  that  contains  a  subset  of  the  dependent  variables 
y i, . . . ,  yg- 1,  yg+i,  . . . ,  yG  of  the  other  G  —  1  equations.  Yg  is  endogenous  as  it  is 
correlated  with  model  errors.  The  model  for  the  z'th  individual  can  equivalently  be 
written  as 


y;B  +  z;.r  =  Ui,  (6.101) 

where  y,  =  [ v,  1  - .  -yicA'  is  a  G  x  1  vector  of  endogenous  variables,  z,  is  an  r  x  1 
vector  of  exogenous  variables  that  is  the  union  of  z,i, . . . ,  z(G,  u,  =  [«,i- . .  zqG]'  is 
a  G  x  1  error  vector,  B  is  a  G  x  G  parameter  matrix  with  diagonal  entries  unity,  T  is 
anrxG  parameter  matrix,  and  some  of  the  entries  in  B  and  T  are  constrained  to  be 
unity.  It  is  assumed  that  u,  is  iid  over  i  with  mean  0  and  variance  matrix  X. 

The  model  (6.101)  is  called  the  structural  form  with  different  restrictions  on  B 
and  r  corresponding  to  different  structures.  Solving  for  the  endogenous  variables  as  a 
function  of  the  exogenous  variables  yields  the  reduced  form 

y,  =  — zjTB1  +U/B-1  (6.102) 

=  z'n  + v(, 

where  II  =  —  TB  1  is  the  r  x  G  matrix  of  reduced  form  parameters  and  v,  =  u,  B  1 
is  the  reduced  form  error  vector  with  variance  $2  =  (B_1yXB_  . 

The  reduced  form  can  be  consistently  estimated  by  OLS,  yielding  estimates  of 
II  =  —  TB  1  and  12  =  (B  1  )'XB  1 .  The  problem  of  identification,  see  Section  2.5, 
is  one  of  whether  these  lead  to  unique  estimates  of  the  structural  form  parameters  B, 
r  and  X.  This  requires  some  parameter  restrictions  since  without  restrictions  B,  T. 
and  X  contain  G 2  more  parameters  than  II  and  12.  A  necessary  condition  for  identi¬ 
fication  of  parameters  in  the  gth  equation  is  the  order  condition  that  the  number  of 
exogenous  variables  excluded  from  the  gth  equation  must  be  at  least  equal  to  the  num¬ 
ber  of  endogenous  variables  included.  This  is  the  same  as  the  order  condition  given 
in  Section  6.4.1.  For  example,  if  Y,g  in  (6.100)  has  one  component,  so  there  is  one 
endogenous  variable  in  the  equation,  then  at  least  one  of  the  components  of  x,  must 
not  be  included.  This  will  ensure  that  there  are  as  many  instruments  as  regressors. 
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A  sufficient  condition  for  identification  is  the  stronger  rank  condition.  This  is  given 
in  many  books  such  as  Greene’s  (2003)  and  for  brevity  is  not  given  here.  Other  restric¬ 
tions,  such  as  covariance  restrictions,  may  also  lead  to  identification. 

Given  identification,  the  structural  model  parameters  can  be  consistently  estimated 
by  separate  estimation  of  each  equation  by  two-stage  least  squares  defined  in  (6.44). 
The  same  set  of  instruments  z,  is  used  for  each  equation.  In  the  gth  equation  the  sub¬ 
component  zig  is  used  as  instrument  for  itself  and  the  remainder  of  z,  is  used  as  instru¬ 
ment  for  Y  ig. 

More  efficient  systems  estimates  are  obtained  using  the  three-stage  least-squares 
(3SLS)  estimator  of  Zellner  and  Theil  (1962),  which  assumes  errors  are  homoskedas- 
tic  but  are  correlated  across  equations.  First,  estimate  the  reduced  form  coefficients  II 
in  (6.102)  by  OLS  regression  of  y  on  z.  Second,  obtain  the  2SLS  estimates  by  OLS  re¬ 
gression  of  (6.100),  where  Y„  is  replaced  by  the  reduced  form  predictions  Y4,  =  z'IIg. 
This  is  OLS  regression  of  yg  on  Y4,  and  z4„  or  equivalently  of  yg  on  xg,  where  x4,  are  the 
predictions  of  Y4,  and  z4,  from  OLS  regression  on  z.  Third,  obtain  the  3SLS  estimates 
by  systems  OLS  regression  of  yg  on  xg,  g  =  1, . . . ,  G.  Then  from  (6.89) 

03SLS  =  [x'  (fT1  ®  Iw)  x]  1  X'  (fT1  ®  I*)  y, 

where  X  is  obtained  by  first  forming  a  block-diagonal  matrix  X,  with  diagonal  blocks 
x/i, . . . ,  x,g  and  then  stacking  Xi,  . . . ,  X ,y ,  and  $7  =  N  1  u,u'  withu,  the  residual 
vectors  calculated  using  the  2SLS  estimates. 

This  estimator  coincides  with  the  systems  GMM  estimator  with  W;y  =  Ly  ®  S2  in 
the  case  that  the  systems  GMM  estimator  uses  the  same  instruments  in  every  equation. 
Otherwise,  3SLS  and  systems  GMM  differ,  though  both  yield  consistent  estimates  if 
E[u,|z,]  =  0. 


6.9.7.  Linear  Systems  ML  Estimation 

The  systems  estimators  for  the  linear  model  are  essentially  LS  or  IV  estimators  with  in¬ 
ference  based  on  robust  standard  errors.  Now  additionally  assume  normally  distributed 
iid  errors,  so  that  u,  ~  A/"[0, 17], 

For  systems  with  exogenous  regressors  the  resulting  MLE  is  asymptotically  equiva¬ 
lent  to  the  GLS  estimator.  These  estimators  do  use  different  estimators  of  17  and  hence 
/3,  however,  so  that  there  are  small-sample  differences  between  the  MLE  and  the  GLS 
estimator.  For  example,  see  Chapter  21  for  the  random  effects  panel  data  model. 

For  the  linear  SEM  (6.101),  the  limited  information  maximum  likelihood  es¬ 
timator,  a  single-equation  ML  estimator,  is  asymptotically  equivalent  to  2SLS.  The 
full  information  maximum  likelihood  estimator,  the  systems  MLE,  is  asymptotically 
equivalent  to  3SLS.  See,  for  example,  Schmidt  (1976)  and  Greene  (2003). 

6.10.  Nonlinear  Sets  of  Equations 

We  now  consider  systems  of  equations  that  are  nonlinear  in  parameters.  For  example, 
demand  equation  systems  obtained  from  a  specified  direct  or  indirect  utility  may  be 
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nonlinear  in  parameters.  More  generally,  if  a  nonlinear  model  is  appropriate  for  a  de¬ 
pendent  variable  studied  in  isolation,  for  example  a  logit  or  Poisson  model,  then  any 
joint  model  for  two  or  more  such  variables  will  necessarily  be  nonlinear. 

We  begin  with  a  discussion  of  fully  parametric  joint  modeling,  before  focusing  on 
partially  parametric  modeling.  As  in  the  linear  case  we  present  models  with  exogenous 
regressors  before  considering  the  complication  of  endogenous  regressors. 


6.10.1.  Nonlinear  Systems  ML  Estimation 

Maximum  likelihood  estimation  for  a  single  dependent  variable  was  presented  in  Sec¬ 
tion  5.6.  These  results  can  be  immediately  applied  to  joint  models  of  several  dependent 
variables,  with  the  very  minor  change  that  the  single  dependent  variable  conditional 
density  /(y,jx,  ,  9)  becomes  /(y,  |X,  ,  9),  where  y,  denotes  the  vector  of  dependent 
variables,  X,  denotes  all  the  regressors,  and  9  denotes  all  the  parameters. 

For  example,  if  y\  ~  A/"[exp(x'1/31),  oy  |  and  y2  ~  Ar[expixj/32).  ny  ]  then  a  suitable 
joint  model  may  be  to  assume  that  (yi ,  _y2)  are  bivariate  normal  with  means  exp(xj/3j) 
and  exp(xj/32),  variances  cy  and  er|,  and  correlation  p. 

For  data  that  are  not  normally  distributed  there  can  be  challenges  in  specifying  and 
selecting  a  sufficiently  flexible  joint  distribution.  For  example,  for  univariate  counts 
a  standard  starting  model  is  the  negative  binomial  (see  Chapter  20).  However,  in  ex¬ 
tending  this  to  a  bivariate  or  multivariate  model  for  counts  there  are  several  alternative 
bivariate  negative  binomial  models  to  choose  from.  These  might  differ,  for  example, 
as  to  whether  the  univariate  conditional  distribution  or  the  univariate  marginal  distri¬ 
bution  is  negative  binomial.  In  contrast  the  multivariate  normal  distribution  has  condi¬ 
tional  and  marginal  distributions  that  are  both  normal.  All  of  these  multivariate  nega¬ 
tive  binomial  distributions  place  some  restrictions  on  the  range  of  correlation  such  as 
restricting  to  positive  correlation,  whereas  for  the  multivariate  normal  there  is  no  such 
restriction. 

Fortunately,  modern  computational  advances  permit  richer  models  to  be  specified. 
For  example,  a  reasonably  flexible  model  for  correlated  bivariate  counts  is  to  assume 
that,  conditional  on  unobservables  S\  and  e2,  y\  is  Poisson  with  mean  expixj  (3i  +  S\) 
and  y2  is  Poisson  with  mean  expixj /d,  +  e2).  An  estimable  bivariate  distribution  can 
be  obtained  by  assuming  that  the  unobservables  S\  and  e2  are  bivariate  normal  and  in¬ 
tegrating  them  out.  There  is  no  closed-form  solution  for  this  bivariate  distribution,  but 
the  parameters  can  nonetheless  be  estimated  using  the  method  of  maximum  simulated 
likelihood  presented  in  Section  12.4. 

A  number  of  examples  of  nonlinear  joint  models  are  given  throughout  Part  4  of  the 
book.  The  simplest  joint  models  can  be  inflexible,  so  consistency  can  rely  on  distribu¬ 
tional  assumptions  that  are  too  restrictive.  However,  there  is  generally  no  theoretical 
impediment  to  specifying  more  flexible  models  that  can  be  estimated  using  computa¬ 
tionally  intensive  methods. 

In  particular,  two  leading  methods  for  generating  rich  multivariate  parametric  mod¬ 
els  are  presented  in  detail  in  Section  19.3.  These  methods  are  given  in  the  context  of 
duration  data  models,  but  they  have  much  wider  applicability.  First,  one  can  introduce 
correlated  unobserved  heterogeneity,  as  in  the  bivariate  count  example  just  given. 
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Second,  one  can  use  copulas,  which  provide  a  way  to  generate  a  joint  distribution 
given  specified  univariate  marginals. 

For  ML  estimation  a  simpler  though  less  efficient  quasi-ML  approach  is  to  specify 
separate  parametric  models  for  yi  and  yo  and  obtain  ML  estimates  assuming  inde¬ 
pendence  of  yi  and  V2  but  then  do  statistical  inference  permitting  yi  and  y?  to  be 
correlated.  This  has  been  presented  in  Section  5.7.5.  In  the  remainder  of  this  section 
we  consider  such  partially  parametric  approaches. 

The  challenges  became  greater  if  there  is  endogeneity,  so  that  a  dependent  variable 
in  one  equation  appears  as  a  regressor  in  another  equation.  Few  models  for  nonlinear 
simultaneous  equations  exist,  aside  from  nonlinear  regression  models  with  additive 
errors  that  are  normally  distributed. 


6.10.2.  Nonlinear  Systems  of  Equations 

For  linear  regression  the  movement  from  single  equation  to  multiple  equations  is  clear 
as  the  starting  point  is  the  linear  model  y  =  x' (3  +  u  and  estimation  is  by  least  squares. 
Efficient  systems  estimation  is  then  by  systems  GLS  estimation.  For  nonlinear  models 
there  can  be  much  more  variety  in  the  starting  point  and  estimation  method. 

We  define  the  multivariate  nonlinear  model  with  G  dependent  variables  to  be 

r(y; ,  X; ,  (3)  —  u, ,  (6.103) 

where  y,  and  u,  are  G  x  1  vectors,  r(y, ,  X, ,  (3)  is  a  G  x  1  vector  function,  X,  is  a 
G  x  L  matrix,  and  /3  is  a  K  x  1  column  vector.  Throughout  this  section  we  make  the 
cross-section  assumption  that  the  error  vector  u,  is  independent  over  i,  but  components 
of  u,  for  given  i  may  be  correlated  with  variances  and  covariances  that  vary  over  i. 

One  example  of  (6.103)  is  a  nonlinear  seemingly  unrelated  regression  model. 
Then  the  gth  of  G  equations  for  the  ith  of  N  individuals  is  given  by 

rg(yig,Xig,Pg)=Uig,  g  =  l,...,G.  (6.104) 

For  example,  Ujg  =  y,g  —  expix'^/E,).  Then  u,  and  r(-)  in  (6.103)  are  G  x  1  vectors 
with  gth  entries  Ujg  and  ;%,(■),  X,  is  the  same  block-diagonal  matrix  as  that  defined  in 
(6.91),  and  (3  is  obtained  by  stacking  (31  to  f3G. 

A  second  example  is  a  nonlinear  panel  data  model.  Then  for  individual  i  in 
period  t 

r(yit,Xit,P)  =  Ui„  t=l,...,T.  (6.105) 

Then  u,  and  r(-)  in  (6.103)  arc  T  x  I  vectors,  so  G  =  T,  with  rth  entries  n,,  and 
r(jit,  xn ,  /3).  The  panel  model  differs  from  the  SUR  model  by  having  the  same  func¬ 
tion  r(-)  and  parameters  (3  in  each  period. 


6.10.3.  Nonlinear  Systems  Estimation 
When  the  regressors  X,  in  the  model  (6.103)  are  exogenous 

E[u,|X,]  =  0,  (6.106) 
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where  u,  is  the  error  term  defined  in  (6.103).  We  assume  that  the  error  term  is  inde¬ 
pendent  over  i,  and  the  variance  matrix  is 

12,- =E[u,uj|X,].  (6.107) 


Additive  Errors 

Systems  estimation  is  a  straightforward  adaptation  of  systems  OLS  and  FGLS  estima¬ 
tion  of  the  linear  models  when  the  nonlinear  model  is  additive  in  the  error  term,  so  that 
(6.103)  specializes  to 

u,-  =  y,-  -  g(X,-,  f3).  (6.108) 

Then  the  systems  NLS  estimator  minimizes  the  sum  of  squared  residuals  JT  u'u,, 
whereas  the  systems  FGNLS  estimator  minimizes 

QnU 3)  =  ^u;nir1U/,  (6.109) 

i 

where  we  specify  a  model  12,(7)  f°r  Q  and  12,  =  12,(7).  To  guard  against  possible 
misspecification  of  12,  one  can  use  robust  standard  errors  that  essentially  require  only 
that  u,  is  independent  and  satisfies  (6.106).  Then  the  estimated  variance  of  the  systems 
FGNLS  estimator  is  the  same  as  that  for  the  linear  systems  FGLS  estimator  in  (6.87), 
with  X,  replaced  by  3g(y,-,  (3)/d(3'\^  and  now  u,  =y  ,•  —  g(X,,/3).  The  estimated  vari¬ 
ance  of  the  simpler  systems  NLS  estimator  is  obtained  by  additionally  replacing  12, 
by  IG. 

The  main  challenge  can  be  specifying  a  useful  model  for  12,  .  As  an  example,  sup¬ 
pose  we  wish  to  jointly  model  two  count  data  variables.  In  Chapter  20  we  show 
that  a  standard  model  for  counts,  a  little  more  general  than  the  Poisson  model, 
specifies  the  conditional  mean  to  be  cxp(x'/3)  and  the  conditional  variance  to  be  a 
multiple  of  exp(x'/3).  Then  a  joint  model  might  specify  u  =  [u\  u2  ]',  where  a  \  = 
yi  —  cxpixj /3] )  and  u2  =  y 2  —  exptxj/T).  The  variance  matrix  12,  then  has  diagonal 
entries  a\  exp(x' ,  /3, )  and  a2  exp(x-2/32),  and  one  possible  parameterization  for  the  co- 
variance  is  cyd cxp(x' ,  /l, )  exp(x'2/32)l  2.  The  estimate  12,  then  requires  estimates  of 
(32,  (Xu  a2.  and  a2  that  may  be  obtained  from  first-step  single-equation  estimation. 


Nonadditive  Errors 

With  nonadditive  errors  least-squares  regression  is  no  longer  appropriate,  as  shown 
in  the  single-equation  case  in  Section  6.2.2.  Wooldridge  (2002)  presents  consistent 
method  of  moments  estimation. 

The  conditional  moment  restriction  (6.106)  leads  to  many  possible  unconditional 
moment  conditions  that  can  be  used  for  estimation.  The  obvious  starting  point  is  to 
base  estimation  on  the  moment  conditions  E[X'u,  ]  =  0.  However,  other  moment  con¬ 
ditions  may  be  used.  We  more  generally  consider  estimation  based  on  K  moment 
conditions 

E[R(X,,/3)Ti,]  =  0,  (6.110) 
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where  R(X,-,/3)  is  a  K  x  G  matrix  of  functions  of  X,  and  (3.  The  specification  of 
R(X;  ,  (3)  and  possible  dependence  on  (3  are  discussed  in  the  following. 

By  construction  there  are  as  many  moment  conditions  as  parameters.  The  sys¬ 
tems  method  of  moments  estimator  /3SMM  solves  the  corresponding  sample  moment 
conditions 

1  N 

-  J2  R (X,- ,  PYr(ji ,  X,- ,  /3SMM)  =  0,  (6.111) 


where  in  practice  R(X, ,  (3)  is  evaluated  at  a  first-step  estimate  (3.  This  estimator  is 
asymptotically  normal  with  variance  matrix 


V  [/3smm]  — 


E6:fi 


i=i 


-l 


Er-°^r 


i=i 


E*:6 


i=  1 


-1 


(6.112) 


where  D,  =  9r,/9/3'|^,  R,  =  R(X,,  (3),  and  u,  =  rtv, ,  X, ,  /3SMM). 

The  main  issue  is  specification  of  R(X,  (3 )  in  (6.1 10).  From  Section  6.3.7,  the  most 
efficient  estimator  based  on  (6.106)  specifies 


R*(X(-,/3)  =  E 


9r(y  ,-.X,-,/3)' 
9/3 


IT 


(6.113) 


In  general  the  first  expectation  on  the  right-hand  side  requires  strong  distributional 
assumptions,  making  optimal  estimation  difficult. 

Simplification  does  occur,  however,  if  the  nonlinear  model  is  one  with  additive  er¬ 
ror  defined  in  (6.108).  Then  R*(X;,  (3)  =  9g(X,-,  (3)' /d(3  x  fir1,  and  the  estimating 
equations  (6.1 10)  become 


at'E 


8g(X,-./3)' 

9/3 


$V( y(.  -  X'3SMM)  =  0. 


This  estimator  is  asymptotically  equivalent  to  the  systems  FGNLS  estimator  that  min¬ 
imizes  (6.109). 


6.10.4.  Nonlinear  Systems  IV  Estimation 

When  the  regressors  X,  in  the  model  (6.103)  are  endogenous,  so  that  E[u,  |X,]  /  0,  we 
assume  the  existence  of  a  G  x  r  matrix  of  instruments  Z,  such  that 

E[u/|Z,]  =  0,  (6.114) 

where  u,  is  the  error  term  defined  in  (6. 103).  We  assume  that  the  error  term  is  indepen¬ 
dent  over  i,  and  the  variance  matrix  is  FI,  =  E[ u,u'  |Z,].  For  the  nonlinear  SUR  model 
Z,  is  as  defined  in  (6.99). 

The  approach  is  similar  to  that  used  in  the  preceding  section  for  the  systems  MM 
estimator,  with  the  additional  complication  that  now  there  may  be  a  surplus  of  instru¬ 
ments  leading  to  a  need  for  GMM  estimation  rather  than  just  MM  estimation.  Condi¬ 
tional  moment  restriction  (6. 106)  leads  to  many  possible  unconditional  moment  condi¬ 
tions  that  can  be  used  for  estimation.  Here  we  follow  many  others  in  basing  estimation 
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on  the  moment  conditions  E[Z-u,-]  =  0.  Then  a  systems  GMM  estimator  minimizes 


Qn(P)  = 


^Z'r(y„X;,/3) 


WA 


Ez;r(y  mX/,/3) 


i=l 


(6.115) 


This  estimator  is  asymptotically  normal  with  estimated  variance 

V  [3sgmm]  =  N  [D'ZW^Z'D]  1  [D'ZW^SWivZ'D]  [D'ZW^Z'D]-1  ,  (6.116) 

where  D'Z  =  JT  9r/ /3/3 Z,-  and  S  =  N  1  Z,u)u'ZJ  and  we  assume  u,  is  inde¬ 
pendent  over  i  with  variance  matrix  V[u,  |X,]  =  12,. 

The  choice  W#  =  [(V-1  Z,Z-]_1  corresponds  to  NL2SLS  in  the  case 

that  r(y, ,  X, ,  (3)  is  obtained  from  a  nonlinear  SUR  model.  The  choice  YV,y  = 
[A  1  ZiOZ;]-1,  where  f2  =  (V^1  ujuj,  is  called  nonlinear  3SLS  (NL3SLS) 
and  is  the  most  efficient  estimator  based  on  the  moment  condition  E[Z-U;]  =  0  in  the 
special  case  that  f 2,  =  12.  The  choice  YV,v  =  S  1  gives  the  most  efficient  estimator  un¬ 
der  the  more  general  assumption  that  12,  may  vary  with  i .  As  usual,  however,  moment 
conditions  other  than  E[Z.u,]  =  0  may  lead  to  more  efficient  estimators. 


6.10.5.  Nonlinear  Simultaneous  Equations  Systems 

The  nonlinear  simultaneous  equations  model  specifies  that  the  gth  of  G  equations 
for  the  /  th  of  N  individuals  is  given  by 

uig  =  rg(y,-,  xig,  (3g),  g  =  1 . G.  (6.117) 

This  is  the  nonlinear  SUR  model  with  regressors  that  now  include  dependent  variables 
from  other  equations.  Unlike  the  linear  SEM,  there  are  few  practically  useful  results  to 
help  ensure  that  a  nonlinear  SEM  is  identified. 

Given  identification,  consistent  estimates  can  be  obtained  using  the  GMM  estima¬ 
tors  presented  in  the  previous  section.  Alternatively,  we  can  assume  that  u,  ~  Ar[0, 12] 
and  obtain  the  nonlinear  full-information  maximum  likelihood  estimator.  In  a  de¬ 
parture  from  the  linear  SEM,  the  nonlinear  full-information  MLE  in  general  has  an 
asymptotic  distribution  that  differs  from  NL3SLS,  and  consistency  of  the  nonlinear 
full-information  MLE  requires  that  the  errors  are  actually  normally  distributed.  For 
details  see  Amemiya  (1985). 

Handling  endogeneity  in  nonlinear  models  can  be  complicated.  Section  16.8  con¬ 
siders  simultaneity  in  Tobit  models,  where  analysis  is  simpler  when  the  model  is  linear 
in  the  latent  variables.  Section  20.6.2  considers  a  more  highly  nonlinear  example,  en¬ 
dogenous  regressors  in  count  data  models. 


6.11.  Practical  Considerations 

Ideally  GMM  could  be  implemented  using  an  econometrics  package,  requiring  little 
more  difficulty  and  knowledge  than  that  needed,  say,  for  nonlinear  least-squares  esti¬ 
mation  with  heteroskedastic  errors.  However,  not  all  leading  econometrics  packages 
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provide  a  broad  GMM  module.  Depending  on  the  specific  application,  GMM  estima¬ 
tion  may  require  a  switch  to  a  more  suitable  package  or  use  of  a  matrix  programming 
language  along  with  familiarity  with  the  algebra  of  GMM. 

A  common  application  of  GMM  is  IV  estimation.  Most  econometrics  packages  in¬ 
clude  linear  IV  but  not  all  include  nonlinear  IV  estimators.  The  default  standard  errors 
may  assume  homoskedastic  errors  rather  than  being  heteroskedastic -robust.  As  already 
emphasized  in  Chapter  4,  it  can  be  difficult  to  obtain  instruments  that  are  uncorrelated 
with  the  error  yet  reasonably  correlated  with  the  regressor  or,  in  the  nonlinear  case,  the 
appropriate  derivative  of  the  error  with  respect  to  parameters. 

Econometrics  packages  usually  include  linear  systems  but  not  nonlinear  systems. 
Again,  default  standard  errors  may  not  be  robust  to  heteroskedasticity. 

6.12.  Bibliographic  Notes 

Textbook  treatments  of  GMM  include  chapters  by  Davidson  and  MacKinnon  (1993,  2004), 
Hamilton  (1994),  and  Greene  (2003).  The  more  recent  books  by  Hayashi  (2000)  and 
Wooldridge  (2002)  place  considerable  emphasis  on  GMM  estimation.  Bera  and  Bilias  (2002) 
provide  a  synthesis  and  history  of  many  of  the  estimators  presented  in  Chapters  5  and  6. 

6.3  The  original  reference  for  GMM  is  Hansen  (1982).  A  good  explanation  of  optimal  mo¬ 
ments  for  GMM  is  given  in  the  appendix  of  Arellano  (2003).  The  October  2002  issue  of 
Journal  of  Business  and  Economic  Statistics  is  devoted  to  GMM  estimation. 

6.4  The  classic  treatment  of  linear  IV  estimation  by  Sargan  (1958)  is  a  key  precursor  to  GMM. 

6.5  The  nonlinear  2SLS  estimator  introduced  by  Amemiya  (1974)  generalizes  easily  to  the 
GMM  estimator. 

6.6  Standard  references  for  sequential  two-step  estimation  are  Newey  (1984),  Murphy  and 
Topel  (1985),  and  Pagan  (1986). 

6.7  A  standard  reference  for  minimum  distance  estimation  is  Chamberlain  (1982). 

6.8  A  good  overview  of  empirical  likelihood  is  provided  by  Mittelhammer,  Judge,  and  Miller 
(2000)  and  key  references  are  Owen  (1988,  2001)  and  Qin  and  Lawless  (1994).  Imbens 
(2002)  provides  a  review  and  application  of  this  relatively  new  method. 

6.9  Texts  such  as  Greene’s  (2003)  provide  a  more  detailed  coverage  of  systems  estimation 
than  that  provided  here,  especially  for  linear  seemingly  unrelated  regressions  and  linear 
simultaneous  equations  models. 

6.10  Amemiya  (1985)  presents  nonlinear  simultaneous  equations  in  detail. 


- Exercises - 

6-1  For  the  gamma  regression  model  of  Exercise  5.2,  E[y|x]  =  exp(x'/3)  and  V[y|x]  = 
(exp(x'/3))2/2. 

(a)  Show  that  these  conditions  imply  that  E[x{(y-  x'/3)2  -  (exp(x'/3))2/2}j  =  0. 

(b)  Use  the  moment  condition  in  part  (a)  to  form  a  method  of  moments  estimator 

/3mm- 

(c)  Give  the  asymptotic  distribution  of  /3MM  using  result  (6.13)  . 

(d)  Suppose  we  use  the  moment  condition  E[x(y-  exp(x'/3))]  in  addition  to  that 
in  part  (a).  Give  the  objective  function  for  a  GMM  estimator  of  (3. 
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6-2  Consider  the  linear  regression  model  for  data  independent  over  /'  with  y ;•  = 
x'iP  +  Ui.  Suppose  E[u,|X/]/0  but  there  are  available  instruments  z,-  with 
E[u,]z,j  =  0  and  V[u,|z,]  =  of,  where  dim(z)  >  dim(x).  We  consider  the  GMM  es¬ 
timator  (3  that  minimizes 

Qn(J3)  =  [A/-1  i(y  -  x'M'WnIN-'  ,(yj  -  x'/3)]. 

/  / 

(a)  Derive  the  limit  distribution  of  yfN(/3  -  (30)  using  the  general  GMM  result 
(6.11). 

(b)  State  how  to  obtain  a  consistent  estimate  of  the  asymptotic  variance  of  (3. 

(c)  If  errors  are  homoskedastic  what  choice  of  Ww  would  you  use?  Explain  your 
answer. 

(d)  If  errors  are  heteroskedastic  what  choice  of  Ww  would  you  use?  Explain  your 
answer. 

6-3  Consider  the  Laplace  intercept-only  example  at  the  end  of  Section  6.3.6,  so 
y=  H  +  u.  Then  GMM  estimation  is  based  on  E[h(/z)j  =  0,  where  h(/z)  =  [(y- 
Ar).(y-M)3]'. 

(a)  Using  knowledge  of  the  central  moments  of  y  given  in  Section  6.3.6,  show 
that  G0  =  E[9h/9ju]  =  [-1,  -6]'  and  that  S0  =  E[hh']  has  diagonal  entries  2 
and  720  and  off-diagonal  entries  24. 

(b)  Hence  show  that  G'qS^Go  =  252/432. 

(c)  Hence  show  that  /x0gmm  has  asymptotic  variance  1.7143/A/. 

(d)  Show  that  the  GMM  estimator  of  /z  with  W  =  l2  has  asymptotic  variance 
19.14/A/. 

6-4  This  question  uses  the  probit  model  but  requires  little  knowledge  of  the  model. 
Let  y  denote  a  binary  variable  that  takes  value  0  or  1  according  to  whether  or 
not  an  event  occurs,  let  x  denote  a  regressor  vector,  and  assume  independent 
observations. 

(a)  Suppose  E[y|x]  =  <f>(x'/3),  where  «f>(-)  is  the  standard  normal  cdf.  Show  that 
E[(y-  4>(x'/3))x]  =  0.  Hence  give  the  estimating  equations  for  a  method  of 
moments  estimator  for  /3. 

(b)  Will  this  estimator  yield  the  same  estimates  as  the  probit  MLE?  [For  just  this 
part  you  need  to  read  Section  14.3.] 

(c)  Give  a  GMM  objective  function  corresponding  to  the  estimator  in  part  (a). 
That  is,  give  an  objective  function  that  yields  the  same  first-order  conditions, 
up  to  a  full-rank  matrix  transformation,  as  those  obtained  in  part  (a). 

(d)  Now  suppose  that  because  of  endogeneity  in  some  of  the  components 
E[y|x]  /  $(x'/3).  Assume  there  exists  a  vector  z,  dim[z]  >  dim[x],  such  that 
E[y  -  4>(x'/3)|z]  =  0.  Give  the  objective  function  for  a  consistent  estimator  of 
/3.  The  estimator  need  not  be  fully  efficient. 

(e)  For  your  estimator  in  part  (d)  give  the  asymptotic  distribution  of  the  estimator. 
State  clearly  any  assumptions  made  on  the  dgp  to  obtain  this  result. 

(f)  Give  the  weighting  matrix,  and  a  way  to  calculate  it,  for  the  optimal  GMM 
estimator  in  part  (d). 

(g)  Give  a  real-world  example  of  part  (d).  That  is,  give  a  meaningful  example  of 
a  probit  model  with  endogenous  regressor(s)  and  valid  instrument(s).  State 
the  dependent  variable,  the  endogenous  regressor(s),  and  the  instrument(s) 
used  to  permit  consistent  estimation.  [This  part  is  surprisingly  difficult.] 
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6-5  Suppose  we  impose  the  constraint  that  E[w,]  =  g(0),  where  dim[w]  >  dim[#j. 

(a)  Obtain  the  objective  function  for  the  GMM  estimator. 

(b)  Obtain  the  objective  function  for  the  minimum  distance  estimator  (see  Sec¬ 
tion  6.7)  with  7r  =  E[w,j  and  7?  =  w. 

(c)  Show  that  MD  and  GMM  are  equivalent  in  this  example. 

6-6  The  MD  estimator  (see  Section  6.7)  uses  the  restriction  tv  -  g(ff)  =  0.  Suppose 
more  generally  that  the  restriction  is  h(0,  -k)  =  0  and  we  estimate  using  the  gen¬ 
eralized  MD  estimator  that  minimizes  Qw(0)  =  h(0,  n )'Wwh(0,  n).  Adapt  (6.68)- 
(6.70)  to  show  that  (6.67)  holds  with  G0  =  9h(0,  Tv)/dO\go  ^  and  V[7?]  replaced  by 
H'0V[7?]Ho,  where  H0  =  9h(0,  vr)/a-7r |eo 

6-7  For  data  generated  from  the  dgp  given  in  Section  6.6.4  with  N=  1.000,  obtain 
NL2SLS  estimates  and  compare  these  to  the  two-stage  estimates. 
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CHAPTER  7 


Hypothesis  Tests 


7.1.  Introduction 

In  this  chapter  we  consider  tests  of  hypotheses,  possibly  nonlinear  in  the  parameters, 
using  estimators  appropriate  for  nonlinear  models. 

The  distribution  of  test  statistics  can  be  obtained  using  the  same  statistical  theory  as 
that  used  for  estimators,  since  test  statistics  like  estimators  are  statistics,  that  is,  func¬ 
tions  of  the  sample.  Given  appropriate  linearization  of  estimators  and  hypotheses,  the 
results  closely  resemble  to  those  for  testing  linear  restrictions  in  the  linear  regression 
model.  The  results  rely  on  asymptotic  theory,  however,  and  exact  t-  and  /-'-distributed 
test  statistics  for  the  linear  model  under  normality  are  replaced  by  test  statistics  that 
are  asymptotically  standard  normal  distributed  (/-tests)  or  chi-square  distributed. 

There  are  two  main  practical  concerns  in  hypothesis  testing.  First,  tests  may  have 
the  wrong  size,  so  that  in  testing  at  a  nominal  significance  level  of,  say,  5%,  the  ac¬ 
tual  probability  of  rejection  of  the  null  hypothesis  may  be  much  more  or  less  than 
5%.  Such  a  wrong  size  is  almost  certain  to  arise  in  moderate  size  samples  as  the  un¬ 
derlying  asymptotic  distribution  theory  is  only  an  approximation.  One  remedy  is  the 
bootstrap  method,  introduced  in  this  chapter  but  sufficiently  important  and  broad  to  be 
treated  separately  in  Chapter  11.  Second,  tests  may  have  low  power,  so  that  there  is  low 
probability  of  rejecting  the  null  hypothesis  when  it  should  be  rejected.  This  potential 
weakness  of  tests  is  often  neglected.  Size  and  power  are  given  more  prominence  here 
than  in  most  textbook  treatments  of  testing. 

The  Wald  test,  the  most  widely  used  testing  procedure,  is  defined  in  Section  7.2. 
Section  7.3  additionally  presents  the  likelihood  ratio  test  and  score  or  Lagrange  mul¬ 
tiplier  tests,  applicable  when  estimation  is  by  ML.  The  various  tests  are  illustrated  in 
Section  7.4.  Section  7.5  extends  these  tests  to  estimators  other  than  ML,  including  ro¬ 
bust  forms  of  tests.  Sections  7.6,  7.7,  and  7.8  present,  respectively,  test  power,  Monte 
Carlo  simulation  methods,  and  the  bootstrap. 

Methods  for  determining  model  specification  and  selection,  rather  than  hypothesis 
tests  per  se,  are  given  separate  treatment  in  Chapter  8. 
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7.2.  Wald  Test 

The  Wald  test,  due  to  Wald  (1943),  is  the  preeminent  hypothesis  test  in  microecono¬ 
metrics.  It  requires  estimation  of  the  unrestricted  model,  that  is,  the  model  without 
imposition  of  the  restrictions  of  the  null  hypothesis.  The  Wald  test  is  widely  used  be¬ 
cause  modem  software  usually  permits  estimation  of  the  unrestricted  model  even  if 
it  is  more  complicated  than  the  restricted  model,  and  modern  software  increasingly 
provides  robust  variance  matrix  estimates  that  permit  Wald  tests  under  relatively  weak 
distributional  assumptions.  The  usual  statistics  for  tests  of  statistical  significance  of 
regressors  reported  by  computer  packages  are  examples  of  Wald  test  statistics. 

This  section  presents  the  Wald  test  of  nonlinear  hypotheses  in  considerable  detail, 
presenting  both  theory  and  examples.  The  closely  related  delta  method,  used  to  form 
confidence  intervals  or  regions  for  nonlinear  functions  of  parameters,  is  also  presented. 
A  weakness  of  the  Wald  test  -  its  lack  of  invariance  to  algebraically  equivalent  param- 
eterizations  of  the  null  hypothesis  -  is  detailed  at  the  end  of  the  section. 


7.2.1.  Linear  Hypotheses  in  Linear  Models 

We  first  review  standard  linear  model  results,  as  the  Wald  test  is  a  generalization  of  the 
usual  test  for  linear  restrictions  in  the  linear  regression  model. 

The  null  and  alternative  hypotheses  for  a  two-sided  test  of  linear  restrictions  on  the 
regression  parameters  in  the  linear  regression  model  y  =  X'/3  +  u  are 


H0  :  R/30  -  r  =  0, 
Ha  :  R/30  -  r  /  0, 


where  in  the  notation  used  here  there  are  h  restrictions,  R  is  an  h  x  K  matrix  of  con¬ 
stants  of  full  rank  h,  (3  is  the  K  x  I  parameter  vector,  r  is  an  h  x  1  vector  of  constants, 
and  h  <  K. 

For  example,  a  joint  test  that  /fi  =  1  and  fJn  —  (h  =  2  when  K  =  4  can  be  expressed 
as  (7.1)  with 


R  = 


10  0  0 
0  1-10 


The  Wald  test  of  R/30  —  r  =  0  is  a  test  of  closeness  to  zero  of  the  sample  analogue 
R/3  —  r,  where  (3  is  the  unrestricted  OLS  estimator.  Under  the  strong  assumption  that 
u  ~  A/"[0,  (7q  I],  the  estimator  (3  ~  Af  \(30 ,  cr^  (X'X)-1]  and  so 

R3  -  r  ~  TV  [0,  rr^RtX'Xr'R'] , 

under  Ho,  where  R/30  —  r  =  0  has  led  to  simplification  to  a  mean  of  0.  Taking  the 
quadratic  form  leads  to  the  test  statistic 

Wi  =  (R3  -  r)'  [ajRtX'Xr'R']-1  (R/3  -  r), 

which  is  exactly  /2(/r)  distributed  under  Hq.  In  practice  the  test  statistic  Wi  cannot  be 
calculated,  however,  as  ctq  is  not  known. 


224 


7.2.  WALD  TEST 


In  large  samples  replacing  er(y  by  its  estimate  .v2  does  not  affect  the  limit  distribution 
of  Wi,  since  this  is  equivalent  to  premultiplication  of  W]  by  <r,y /s2  and  plimta,2/*2)  = 
1  (see  the  Transformation  Theorem  A.  12).  Thus 

W2  =  (R/3  -  r)'  [^2R(X'X)_1R']_1  (R3  -  /■)  (7.2) 

converges  to  the  x2(/z)  distribution  under  Hq. 

The  test  statistic  Wi  is  chi-square  distributed  only  asymptotically.  In  this  linear 
example  with  normal  errors  an  alternative  exact  small-sample  result  can  be  obtained. 
A  standard  result  derived  in  many  introductory  texts  is  that 


W3  =  W  2/h 


is  exactly  F(h,  N  —  K)  distributed  under  Ht],  if  s2  =  (N  —  K )  1  u2.  where  if,  is 

the  OLS  residual.  This  is  the  familiar  F— test  statistic,  which  is  often  reexpressed  in 
terms  of  sums  of  squared  residuals. 

Exact  results  such  as  that  for  W3  are  not  possible  in  nonlinear  models,  and  even  in 
linear  models  they  require  very  strong  assumptions.  Instead,  the  nonlinear  analogue  of 
Wi  is  employed,  with  distributional  results  that  are  asymptotic  only. 


7.2.2.  Nonlinear  Hypotheses 

We  consider  hypothesis  tests  of  h  restrictions,  possibly  nonlinear  in  parameters,  on 
the  q  x  1  parameter  vector  9,  where  h  <  q.  For  linear  regression  9  =  (3  and  q  =  K. 
The  null  and  alternative  hypotheses  for  a  two-sided  test  are 


Ho  :  h(0o)  =  0, 
Ha  :  h(0o)  /  0, 


where  h(-)  is  a  h  x  I  vector  function  of  6.  Note  that  hi 61)  in  this  chapter  is  used  to 
denote  the  restrictions  of  the  null  hypothesis.  This  should  not  be  confused  with  the  use 
of  h(w,  9)  in  the  previous  chapter  to  denote  the  moment  conditions  used  to  form  an 
MM  or  GMM  estimator. 

Familiar  linear  examples  include  tests  of  statistical  significance  of  a  single  coeffi¬ 
cient,  h{9)  =  0j  =  0,  and  tests  of  subsets  of  coefficients,  h(0)  =  9o  =  0.  A  nonlinear 
example  of  a  single  restriction  is  h(9)  =  61/62  —  1=0.  These  examples  are  studied 
in  later  sections. 

It  is  assumed  that  h{6)  is  such  that  the  h  x  q  matrix 


R  (0)  = 


3h(6>) 
3  9' 


(7.4) 


is  of  full  rank  h  when  evaluated  at  9  =  9q.  This  assumption  is  equivalent  to  linear  inde¬ 
pendence  of  restrictions  in  the  linear  model,  in  which  case  R(0)  =  R  does  not  depend 
on  9  and  has  rank  h.  It  is  also  assumed  that  the  parameters  are  not  at  the  boundary 
of  the  parameter  space  under  the  null  hypothesis.  This  rules  out,  for  example,  testing 
Hq  :  6\  =  0  if  the  model  requires  6\  >  0. 
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7.2.3.  Wald  Test  Statistic 

The  intuition  behind  the  Wald  test  is  very  simple.  The  obvious  test  of  whether  h(0o)  = 
0  is  to  obtain  estimate  9  without  imposing  the  restrictions  and  see  whether  h(0)  ~  0. 
If  h(0)  ~  Af[O,V[h(0)]]  under  Hq  then  the  test  statistic 

W  =  h(?)'[V[h(?)]]_1h(?)  ~  X\h). 

The  only  complication  is  finding  V[h(0)],  which  will  depend  on  the  restrictions  h(-) 
and  the  estimator  6. 

By  a  first-order  Taylor  series  expansion  (see  section  7.2.4)  under  the  null  hypoth¬ 
esis,  h(0)  has  the  same  limit  distribution  as  R(0O)(0  —  9q ),  where  R(0)  is  defined  in 
(7.4).  Then  h(0)  is  asymptotically  normal  under  Hq  with  mean  zero  and  variance  ma¬ 
trix  R(0o)V[0]R(0oy.  A  consistent  estimate  is  RIV^CR',  where  R  =  R(0)  and  it  is 
assumed  that  the  estimator  6  is  root- /V  consistent  with 

Vn(9  -  6>0 )  4  Af[0,  C0],  (7.5) 

and  C  is  any  consistent  estimate  of  Cq. 


Common  Versions  of  the  Wald  Test 

The  preceding  discussion  leads  to  the  Wald  test  statistic 

W  =  Vh'tRCR'n’h,  (7.6) 

where  h  =  h(0)  and  R=  dh{9)/d9'\^.  An  equivalent  expression  is  W  =  h'[RV 
[0]R']_1h,  where  V[0]  =  N  1 C  is  the  estimated  asymptotic  variance  of  6. 

The  test  statistic  W  is  asymptotically  x2(/z  )  distributed  under  Hq.  So  Hq  is  rejected 
against  Ha  at  significance  level  a  if  W  >  y2(h)  and  is  not  rejected  otherwise.  Equiv¬ 
alently,  Hq  is  rejected  at  level  a.  if  the  p- value,  which  equals  Pr[/2(/r)  >W],  is  less 
than  a. 

One  can  also  implement  the  Wald  test  statistic  as  an  F—  test.  The  Wald  asymptotic 
F-statistic 


F  =  W//z  (7.7) 

is  asymptotically  F(h,  N  —  q)  distributed.  This  yields  the  same  p- value  as  W  in  (7.6) 
as  N  — >  oo  though  in  finite  samples  the  p- values  will  differ.  For  nonlinear  models  it 
is  most  common  to  report  W,  though  F  is  also  used  in  the  hope  that  it  might  provide  a 
better  approximation  in  small  samples. 

For  a  test  of  just  one  restriction,  the  square  root  of  the  Wald  chi-square  test  is  a 
standard  normal  test  statistic.  This  result  is  useful  as  it  permits  testing  a  one-sided 
hypothesis.  Specifically,  for  scalar  h(6)  the  Wald  z-test  statistic  is 

h 

Wz  =  ^  ,  (7.8) 

V  rV_1Cr' 

where  h  =  h(6)  and  T  =  dh{6)/36'\^  is  a  lxl  vector.  Result  (7.6)  implies  that 
W,  is  asymptotically  standard  normal  distributed  under  Hq.  Equivalently,  W-  is 
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asymptotically  t  distributed  with  (N  —  q )  degrees  of  freedom,  since  the  t  goes  to  the 
normal  as  N  — >  oo.  So  W-  can  also  be  a  Wald  i-test  statistic. 


Discussion 

The  Wald  test  statistic  (7.6)  for  the  nonlinear  case  has  the  same  form  as  the  linear 
model  statistic  W2  given  in  (7.2).  The  estimated  deviation  from  the  null  hypothesis  is 
h(0)  rather  than  (R/3  —  r).  The  matrix  R  is  replaced  by  the  estimated  derivative  matrix 
R,  and  the  assumption  that  R  is  of  full  rank  is  replaced  by  the  assumption  that  Ro  is  of 
full  rank.  Finally,  the  estimated  asymptotic  variance  of  the  estimator  is  N  1 C  rather 
than  s2(X,X)_1. 

There  is  a  range  of  possible  consistent  estimates  of  Co  (see  Section  5.5.2),  lead¬ 
ing  in  practice  to  different  computed  values  of  W  or  F  or  W.  that  are  asymptotically 
equivalent.  In  particular,  Co  is  often  of  the  sandwich  form  A(y 1  BqAq  1 ,  consistently  es¬ 
timated  by  a  robust  estimate  A“ 1  BA 1 .  An  advantage  of  the  Wald  test  is  that  it  is  easy 
to  robustify  to  ensure  valid  statistical  inference  under  relatively  weak  distributional 
assumptions,  such  as  potentially  heteroskedastic  errors. 

Rejection  of  Hq  is  more  likely  the  larger  is  W  or  F  or,  for  two-sided  tests,  W-. 
This  happens  the  further  h(0)  is  from  the  null  hypothesis  value  0;  the  more  efficient 
the  estimator  9 ,  since  then  C  is  small;  and  the  larger  the  sample  size  since  then  N  1 
is  small.  The  last  result  is  a  consequence  of  testing  at  unchanged  significance  level 
a  as  sample  size  increases.  In  principle  one  could  decrease  a  as  the  sample  size  is 
increased.  Such  penalties  for  fully  parametric  models  are  presented  in  Section  8.5.1. 


7.2.4.  Derivation  of  the  Wald  Statistic 
By  an  exact  first-order  Taylor  series  expansion  around  9q 

3h 

h<0)=  h(0o)+  TTTw 
ot) 

for  some  G+  between  6  and  9t).  It  follows  that 

VN(h(6)  -  h(6>0))  =  R (G+)Vn(9  -  0„), 

where  R(0)  is  defined  in  (7.4),  which  implies  that 

4v(h(0)  -  h(6>o))  4  N  [0,  R0C0R,)']  (7.9) 

by  direct  application  of  the  limit  normal  product  rule  (Theorem  A.7)  as  R(0+)  4 
Ro  =  R(0O)  and  using  the  limit  distribution  for  \fN (9  —  9o)  given  in  (7.5). 

Under  the  null  hypothesis  (7.9)  simplifies  since  h(#o)  =  0,  and  hence 

VNh(9)  4  U  [0,  RoC0Ro']  (7.10) 

under  Hq.  One  could  in  theory  use  this  multivariate  normal  distribution  to  define  a 
rejection  region,  but  it  is  much  simpler  to  transform  to  a  chi-square  distribution.  Re¬ 
call  that  z  ~  A/"[0,  $2]  with  f2  of  full  rank  implies  z'fU'z  ~  /2(dim(f2)).  Then  (7.10) 


(9  -  90), 
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implies  that 

Mi^/tRoCoRo'rV?)  4  x\h), 

under  H{),  where  the  matrix  inverse  in  this  expression  exists  by  the  assumptions  that  Ro 
and  Co  are  of  full  rank.  The  Wald  statistic  defined  in  (7.6)  is  obtained  upon  replacing 
Ro  and  Co  by  consistent  estimates. 


7.2.5.  Wald  Test  Examples 

The  most  common  tests  are  tests  of  one  or  more  exclusion  restrictions.  We  also  provide 
an  example  of  test  of  a  nonlinear  hypothesis. 


Tests  of  Exclusion  Restrictions 


Consider  the  exclusion  restrictions  that  the  last  h  components  of  0  are  equal  to  zero. 
Then  h(0)  =  02  =  0  where  we  partition  6  =  (0\ ,  6 It  follows  that 


R(0)  = 


9h(6») 

80' 


802  802 

W,  W, 


=  [0  I/,], 


where  0  is  a  (q  —  h)  x  q  matrix  of  zeros  and  I/,  is  an  h  x  h  identity  matrix,  so 


R(0)C(0)R(0)' =  [0  I/,] 


'C11  c12- 

0 

.  C21  C22  _ 

Li/J 

=  C 


22- 


The  Wald  test  statistic  for  exclusion  restrictions  is  therefore 


W  =  02'[N-1C22r102,  (7.11) 

where  N  1  C?2  =  V [ 6*2 1  -  and  is  asymptotically  distributed  as  y2(h  )  under  Hq. 

This  test  statistic  is  a  generalization  of  the  test  of  subsets  of  regressors  in  the  linear 
regression  model.  In  that  case  small-sample  results  are  available  if  errors  are  normally 
distributed  and  the  related  F-test  is  instead  used. 


Tests  of  Statistical  Significance 


Tests  of  significance  of  a  single  coefficient  are  tests  of  whether  or  not  9j,  the  / tli 
component  of  0,  differs  from  zero.  Then  h(0)  =  6j  and  r(0)  =  dh/80'  is  a  vector  of 
zeros  except  for  a  jth  entry  of  1,  so  (7.8)  simplifies  to 


W, 


0j 

se[?y-]' 


(7.12) 


where  se[#;]  =  v//V  1  c/;  is  the  standard  error  of  0 ,  and  is  the  / tli  diagonal  entry 
in  C. 

The  test  statistic  Wz  in  (7.12)  is  often  called  a  “i-statistic  ”,  owing  to  results  for 
the  linear  regression  model  under  normality,  but  strictly  speaking  it  is  an  asymptotic 
“z-statistic.” 
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For  a  two-sided  test  of  H()  :  9jo  =  0  against  Ha  :  #/0  /  0,  H{)  is  rejected  at  signifi¬ 
cance  level  a  if  |  W,  |  >  za/2  and  is  not  rejected  otherwise.  This  yields  exactly  the  same 
results  as  the  Wald  chi-square  test,  since  W?  =  W,  where  W  is  defined  in  (7.6),  and 

z«/2  =  X«(l>- 

Often  there  is  prior  information  about  the  sign  of  0r  Then  one  should  use  a  one¬ 
sided  hypothesis  test.  For  example,  suppose  it  is  felt  based  on  economic  reasoning  or 
past  studies  that  9j  >  0.  It  makes  a  difference  whether  9j  >  0  is  specified  to  be  the  null 
or  the  alternative  hypothesis.  For  one-sided  tests  it  is  customary  to  specify  the  claim 
made  as  the  alternative  hypothesis,  as  it  can  be  shown  that  then  stronger  evidence  is 
required  to  support  the  claim.  Here  Hq  :  9jo  <  0  is  rejected  against  Ha  :  9jo  >  0  at 
significance  level  a  if  W-  >  za.  Similarly,  for  a  claim  that  9j  <  0,  test  Hq  :  9jo  >  0 
against  Ha  :  9,q  <  0  and  reject  Ho  at  significance  level  a  if  W-  <  —za- 

Computer  output  usually  gives  the  p-value  for  a  two-sided  test,  but  in  many  cases 
it  is  more  appropriate  to  use  a  one-sided  test.  If  9 j  has  the  “correct”  sign  then  the 
p -value  for  the  one-sided  test  is  half  that  reported  for  a  two-sided  test. 


Tests  of  Nonlinear  Restriction 


Consider  a  test  of  the  single  nonlinear  restriction 


H0  :  h(0)  =di/92  -1=0. 


Then  R(0)  is  a  I  x  q  vector  with  first  element  dh/d9\  =  1  /9i,  second  element 
dh/d92  =  —0\/9},  and  remaining  elements  zero.  By  letting  Cjk  denote  the  jkth  el¬ 
ement  of  C,  (7.6)  becomes 


W  =  N 


2 

1 

t2 

1 

O 

1 

cn 

C\2  '  '  ' 

1 

i _ 

\ 

) 

C21 

c22  ■  ■  ■ 

-Ox  /e22 

Z  J 

0 

/ 

where  0  is  a  (q  —  2)  x  q  matrix  of  zeros,  yielding 


W  =  !V[?2(?1  -?2)]2(fe  -  2?1?2C12  +  e%L)-\  (7.13) 


which  is  asymptotically  /2(1)  distributed  under  Hq.  Equivalently,  \/W  is  asymptoti¬ 
cally  standard  normal  distributed. 


7.2.6.  Tests  in  Misspecified  Models 

Most  treatments  of  hypothesis  testing,  including  that  given  in  Chapters  7  and  8  of 
this  book,  assume  that  the  null  hypothesis  model  is  correctly  specified,  aside  from 
relatively  minor  misspecification  that  does  not  affect  estimator  consistency  but  requires 
robustification  of  standard  errors. 

In  practice  this  is  a  considerable  oversimplification.  For  example,  in  testing  for  het- 
eroskedastic  errors  it  is  assumed  that  this  is  the  only  respect  in  which  the  regression 
is  deficient.  However,  if  the  conditional  mean  is  misspecified  then  the  true  size  of 
the  test  will  differ  from  the  nominal  size,  even  asymptotically.  Moreover,  asymptotic 
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equivalence  of  tests,  such  as  that  for  the  Wald,  likelihood  ratio,  and  Lagrange  mul¬ 
tiplier  tests,  will  no  longer  hold.  The  better  specified  the  model,  however,  the  more 
useful  are  the  tests. 

Also,  note  that  tests  often  have  some  power  against  hypotheses  other  than  the  ex¬ 
plicitly  stated  alternative  hypothesis.  For  example,  suppose  the  null  hypothesis  model 
is  y  =  Pi  +  (j2x  +  u,  where  u  is  homoskedastic.  A  test  of  whether  to  also  include  z  as 
a  regressor  will  also  have  some  power  against  the  alternative  that  the  model  is  nonlin¬ 
ear  in  x,  for  example  y  =  +  (i2x  +  (J>2x2  +  u,  if  x  and  z  are  correlated.  Similarly,  a 

test  against  heteroskedastic  errors  will  also  have  some  power  against  nonlinearity  in  x. 
Rejection  of  the  null  hypothesis  does  not  mean  that  the  alternative  hypothesis  model 
is  the  only  possible  model. 


7.2.7.  Joint  Versus  Separate  Tests 

In  applied  work  one  often  wants  to  know  which  coefficients  out  of  a  set  of  coefficients 
are  “significant.”  When  there  are  several  hypotheses  under  test,  one  can  either  do  a 
joint  test  or  simultaneous  test  of  all  hypotheses  of  interest  or  perform  separate  tests 
of  the  hypotheses. 

A  leading  example  in  linear  regression  concerns  the  use  of  separate  f-tests  for  test¬ 
ing  the  null  hypotheses  H 10  :  fi\  =  0  and  H2q  :  /St  =  0  versus  using  an  F-test  of  the 
joint  hypothesis  Hq  :  ji\  =  f}2  =  0,  where  throughout  the  alternative  is  that  at  least 
one  of  the  parameters  does  not  equal  zero.  The  F-test  is  an  explicit  joint  test,  with 
rejection  of  Hq  if  the  estimated  point  (jix,  f>2)  falls  outside  an  elliptical  probability 
contour.  Alternatively,  the  two  separate  f-tests  can  be  conducted.  This  procedure  is  an 
implicit  joint  test,  called  an  induced  test  (Savin,  1984).  The  separate  tests  reject  Hq  if 
either  HXq  or  H2q  is  rejected,  which  occurs  if  (Ji , ,  fi2)  falls  outside  a  rectangle  whose 
boundaries  are  the  critical  values  of  the  two  test  statistics.  Even  if  the  same  signifi¬ 
cance  level  is  used  to  test  Hq,  so  that  the  ellipse  and  rectangles  have  the  same  area, 
the  rejection  regions  for  the  joint  and  separate  tests  differ  and  there  is  a  potential  for  a 
conflict  between  them.  For  example,  (]3 , ,  /i2)  may  lie  within  the  ellipse  but  outside  the 
rectangle. 

Let  ex  and  e2  denote  the  event  of  type  I  error  (see  Section  7.5.1)  in  the  two  separate 
tests,  and  let  e\  =  e\  U  e2  denote  the  event  of  a  type  I  error  in  the  induced  joint  test. 
Then  Pidej]  =  Pr[<?i  |  +  Pr[ei]  —  Pr [e2  H  e2\,  which  implies  that 

ai  <  ai  +  a2,  (7.14) 

where  cq,  ai,  and  a 2  denote  the  sizes  of,  respectively,  the  induced  joint  test,  the  first 
separate  test,  and  the  second  separate  test.  In  the  special  case  where  the  separate  tests 
are  statistically  independent,  Pr[ ex  Pi  e2\  =  Prffii  |  Pr[e2 1  =  axa2  and  hence  a.x  =  ax  + 
a2  —  oi\&2.  For  a  typically  low  value  of  o']  and  a2,  such  as  .05  or  .01,  a \a2  is  very 
small  and  the  upper  bound  (7.14)  is  a  good  indicator  of  the  size  of  the  test. 

A  substantial  literature  on  induced  tests  examines  the  problem  of  choosing  critical 
values  for  the  separate  tests  such  that  the  induced  test  has  a  known  size.  We  do  not  pur¬ 
sue  this  issue  at  length  but  mention  the  Bonferroni  f-test  as  an  example.  The  critical 
values  of  this  test  have  been  tabulated;  see  Savin  (1984). 
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Statistically  independent  tests  arise  in  linear  regression  with  orthogonal  regressors 
and  in  likelihood-based  testing  (see  Section  7.3)  if  relevant  parts  of  the  information 
matrix  are  diagonal.  Then  the  induced  joint  test  statistic  is  based  on  the  two  statistically 
independent  separate  test  statistics,  whereas  the  explicit  joint  null  test  statistic  is  the 
sum  of  the  two  separate  test  statistics.  The  joint  null  may  be  rejected  because  either 
one  component  or  both  components  of  the  null  are  rejected.  The  use  of  separate  tests 
will  reveal  which  situation  applies. 

In  the  more  general  case  of  correlated  regressors  or  a  nondiagonal  information  ma¬ 
trix,  the  explicit  joint  test  suffers  from  the  disadvantage  that  the  rejection  of  the  null 
does  not  indicate  the  source  of  the  rejection.  If  the  induced  joint  test  is  used  then  set¬ 
ting  the  size  of  the  test  requires  some  variant  of  the  Bonferroni  test  or  approximation 
using  the  upper  bound  in  (7.14).  Similar  issues  also  arise  when  separate  tests  are  ap¬ 
plied  sequentially,  with  each  stage  conditioned  on  the  outcome  of  the  previous  stage. 
Section  18.7.1  presents  an  example  with  discussion  of  a  joint  test  of  two  hypotheses 
where  the  two  components  of  the  test  are  correlated. 

7.2.8.  Delta  Method  for  Confidence  Intervals 

The  method  used  to  derive  the  Wald  test  statistic  is  called  the  delta  method,  as  Taylor 
series  approximation  of  h(0)  entails  taking  the  derivative  of  h(0).  This  method  can 
also  be  used  to  obtain  the  distribution  of  a  nonlinear  combination  of  parameters  and 
hence  form  confidence  intervals  or  regions. 

One  example  is  estimating  the  ratio  6\ /02  by  O1/O2.  A  second  example  is  prediction 
of  the  conditional  mean  g(x' (3),  say,  using  g(x'/3).  A  third  example  is  the  estimated 
elasticity  with  respect  to  change  in  one  component  of  x. 

Confidence  Intervals 

Consider  inference  on  the  parameter  vector  7  =  h(0)  that  is  estimated  by 

7  =  h(0),  (7.15) 

where  the  limit  distribution  of  \/~N(B  —  Qq)  is  that  given  in  (7.5).  Then  direct  ap¬ 
plication  of  (7.9)  yields  \fN{ 7  —  70)  -a-  A/"[0,  R0C()Ro'],  where  R (6)  is  defined  in 
(7.4).  Equivalently,  we  say  that  7  is  asymptotically  normally  distributed  with  estimated 
asymptotic  variance  matrix 

V[7]  =  RJV-'CR',  (7.16) 

a  result  that  can  be  used  to  form  confidence  intervals  or  regions. 

In  particular,  a  100(1  —  a)%  confidence  interval  for  the  scalar  parameter  y  is 

y  e  y±zaiise\y\  (7.17) 

where 

se[y]  =  VriV-'C?,  (7.18) 

where "r  =  r (6)  and  r (6)  =  dy /dO1  =  dh{d)/d6' . 
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Confidence  Interval  Examples 

As  an  example,  suppose  that  E[y|x]  =exp(x'/3)  and  we  wish  to  obtain  a  confidence 
interval  for  the  predicted  conditional  mean  when  x  =  xp.  Then  h(/3)  =  exp  (x'p(3),  so 
dh/df3'  =  exp  (x'pf3)xp  and  (7.18)  yields 

sefexp  (x'p/3)]  =  exp  (x!p(3)^]x’pN-'Cxp, 

where  C  is  a  consistent  estimate  of  the  variance  matrix  in  the  limit  distribution  of 

VN0-(3O). 

As  a  second  example,  suppose  we  wish  to  obtain  a  confidence  interval  for  efl  rather 
than  for  /},  a  scalar  coefficient.  Then  h{fi)  =  e P,  so  3/7/3 /S  =  e P  and  (7.18)  yields 
se[e^]  =  e^se[/l].  This  yields  a  95%  confidence  interval  for  e P  of  e P  ±  1.96e^se[/J]. 

The  delta  method  is  not  always  the  best  method  to  obtain  a  confidence  interval, 
because  it  restricts  the  confidence  interval  to  being  symmetric  about  y.  Moreover,  in 
the  preceding  example  the  confidence  interval  can  include  negative  values  even  though 
ef>  >  0.  An  alternative  confidence  interval  is  obtained  by  exponentiation  of  the  terms 
in  the  confidence  interval  for  /!.  Then 

Pr[/3  -  1.96se[/J]  <  p  <  /J  +  1.96se[/5]]  =  0.95 
=>•  Pr  [exp 0  -  1.96se[/l])  <  e?  <  exp(/  +  1.96se[/3])]  =  0.95. 

This  confidence  interval  has  the  advantage  of  being  asymmetric  and  including  only 
positive  values.  This  transformation  is  often  used  for  confidence  intervals  for  slope 
parameters  in  binary  outcome  models  and  in  duration  models.  The  approach  can  be 
generalized  to  other  transformations  y  =  h(6),  provided  h(-)  is  monotonic. 


7.2.9.  Lack  of  Invariance  of  the  Wald  Test 

The  Wald  test  statistic  is  easily  obtained,  provided  estimates  of  the  unrestricted  model 
can  be  obtained,  and  is  no  less  powerful  than  other  possible  test  procedures,  as  dis¬ 
cussed  in  later  sections.  For  these  reasons  it  is  the  most  commonly  used  test  procedure. 

However,  the  Wald  test  has  a  fundamental  problem:  It  is  not  invariant  to  alge¬ 
braically  equivalent  parameterizations  of  the  null  hypothesis.  For  example,  consider 
the  example  of  Section  7.2.5.  Then  Hq  :  /  /6b  —  1=0  can  equivalently  be  expressed 
as  Hq  :  6\  —  62  =  0,  leading  to  Wald  chi-square  test  statistic 

W*  =  N (9 1  -?2)2  (cil  -  2c  12  +  C22)-1 ,  (7.19) 

which  differs  from  W  in  (7.13).  The  statistics  W  and  W*  can  differ  substantially  in 
finite  samples,  even  though  asymptotically  they  are  equivalent.  The  small-sample  dif¬ 
ference  can  be  quite  substantial,  as  demonstrated  in  a  Monte  Carlo  exercise  by  Gregory 
and  Veall  (1985),  who  considered  a  very  similar  example.  For  tests  with  nominal  size 
0.05,  one  variant  of  the  Wald  test  had  actual  size  between  0.04  and  0.06  across  all  sim¬ 
ulations,  so  asymptotic  theory  provided  a  good  small-sample  approximation,  whereas 
an  alternative  asymptotically  equivalent  variant  of  the  Wald  test  had  actual  size  that  in 
some  simulations  exceeded  0.20. 
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Phillips  and  Park  (1988)  explained  the  differences  by  showing  that,  although  differ¬ 
ent  representations  of  the  null  hypothesis  restrictions  have  the  same  chi-square  distri¬ 
bution  using  conventional  asymptotic  methods,  they  have  different  asymptotic  distri¬ 
butions  using  a  more  refined  asymptotic  theory  based  on  Edgeworth  expansions  (see 
Section  11.4.3).  Furthermore,  in  particular  settings  such  as  the  previous  example,  the 
Edgeworth  expansions  can  be  used  to  indicate  parameterizations  of  H{)  and  regions 
of  the  parameter  space  where  the  usual  asymptotic  theory  is  likely  to  provide  a  poor 
small-sample  approximation. 

The  lesson  is  that  care  is  needed  when  nonlinear  restrictions  are  being  tested.  As 
a  robustness  check  one  can  perform  several  Wald  tests  using  different  algebraically 
equivalent  representations  of  the  null  hypothesis  restrictions.  If  these  lead  to  substan¬ 
tially  different  conclusions  there  may  be  a  problem.  One  solution  is  to  perform  a  boot¬ 
strap  version  of  the  Wald  test.  This  can  provide  better  small-sample  performance  and 
eliminate  much  of  the  difference  between  Wald  tests  that  use  different  representations 
of  Ho.  because  from  Section  1 1 .4.4  the  bootstrap  essentially  implements  an  Edgeworth 
expansion.  A  second  solution  is  to  use  other  testing  methods,  given  in  the  next  section, 
that  are  invariant  to  different  representations  of  Hf). 


7.3.  Likelihood-Based  Tests 

In  this  section  we  consider  hypothesis  testing  when  the  likelihood  function  is  known, 
that  is,  the  distribution  is  fully  specified.  There  are  then  three  classical  statistical  tech¬ 
niques  for  testing  hypotheses  -  the  Wald  test,  the  likelihood  ratio  (LR)  test,  and  the 
Lagrange  multiplier  (LM)  test.  A  fourth  test,  the  C(o!)  test,  due  to  Neyman  (1959),  is 
less  commonly  used  and  is  not  presented  here;  see  Davidson  and  MacKinnon  (1993). 
All  four  tests  are  asymptotically  equivalent,  so  one  chooses  among  them  based  on  ease 
of  computation  and  on  finite-sample  performance.  We  also  do  not  cover  the  smooth 
test  of  Neyman  (1937),  which  Bera  and  Ghosh  (2002)  argue  is  optimal  and  is  as  fun¬ 
damental  as  the  other  tests. 

These  results  assume  correct  specification  of  the  likelihood  function.  Extension  to 
tests  based  on  quasi-ML  estimators,  as  well  as  on  m-estimators  and  efficient  GMM 
estimators,  is  given  in  Section  7.5. 

7.3.1.  Wald,  Likelihood  Ratio,  and  Lagrange  Multiplier  (Score)  Tests 

Let  L(9 )  denote  the  likelihood  function,  the  joint  conditional  density  of  y  given  X  and 
parameters  6.  We  wish  to  test  the  null  hypothesis  given  in  (7.3)  that  h(0o)  =  0. 

Tests  other  than  the  Wald  test  require  estimation  that  imposes  the  restrictions  of  the 
null  hypothesis.  Define  the  estimators 

0u  (unrestricted  MLE), 

6 r  (restricted  MLE). 

The  unrestricted  MLE  9„  maximizes  In  L(9):  it  was  more  simply  denoted  9  in  ear¬ 
lier  discussion  of  the  Wald  test.  The  restricted  MLE  9r  maximizes  the  Lagrangian 
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In  L(9)  —  A'lifd),  where  A  is  an  h  x  1  vector  of  Lagrangian  multipliers.  In  the  simple 
case  of  exclusion  restrictions  h(0)  =  02  =  0,  where  9  =  (9\ ,  9'2)'.  the  restricted  MLE 
is  er  =  (e\r,n  where  9lr  is  obtained  simply  as  the  maximum  with  respect  to  9\  of 
the  restricted  likelihood  In  L{9\ ,  0)  and  0  is  a  (q  —  h)  x  1  vector  of  zeros. 

We  motivate  and  define  the  three  test  statistics  here,  with  derivation  deferred  to 
Section  7.3.3.  All  three  test  statistics  converge  in  distribution  to  /2(/r )  under  Hq.  So 
Hq  is  rejected  at  significance  level  a  if  the  computed  test  statistic  exceeds  x2(h ). 
Equivalently,  reject  Hq  at  level  a  if  p  <  a,  where  p  =  Pr  \/2(h )  >  r]  is  the  p- value 
and  t  is  the  computed  value  of  the  test  statistic. 

Likelihood  Ratio  Test 

The  motivation  for  the  LR  test  statistic  is  that  if  Hq  is  true,  the  unconstrained  and 
constrained  maxima  of  the  log-likelihood  function  should  be  the  same.  This  suggests 
using  a  function  of  the  difference  between  In  L(9U)  and  In  L(9r). 

Implementation  requires  obtaining  the  limit  distribution  of  this  difference.  It  can  be 
shown  that  twice  the  difference  is  asymptotically  chi-square  distributed  under  Hq.  This 
leads  immediately  to  the  likelihood  ratio  test  statistic 

LR  =  -2  [in  L(9r)  -  In L(?„)] .  (7.21) 


Wald  Test 


The  motivation  for  the  Wald  test  is  that  if  Hq  is  true,  the  unrestricted  MLE  9„  should 
satisfy  the  restrictions  of  Hq,  so  h(9u)  should  be  close  to  zero. 

Implementation  requires  obtaining  the  asymptotic  distribution  of  h(0„).  The  general 
form  of  the  Wald  test  is  given  in  (7.6).  Specialization  occurs  for  the  MLE  because  by 
the  IM  equality  V[#„]  =  —  N~1Aq~1,  where 


Aq  =  plim  N 


_j 32 In  L 


8939' 


This  leads  to  the  Wald  test  statistic 


W=  —Ah'  [RA“‘R']  ‘h, 


(7.22) 


(7.23) 


where  h  =  h(0„),  R  =  R(0„),  R(0)  =  dh(9)/89',  and  A  is  a  consistent  estimate  of  Ao. 
The  minus  sign  appears  since  Aq  is  negative  definite. 


Lagrange  Multiplier  Test  or  Score  Test 

One  motivation  for  the  LM  test  statistic  is  that  the  gradient  8  In  L/d9\-§  =  0  at  the 
maximum  of  the  likelihood  function.  If  Hq  is  true,  then  this  maximum  should  also 
occur  at  the  restricted  MLE  (i.e.,  8  In  L/89\(j  ~  0)  because  imposing  the  constraint 
will  have  little  impact  on  the  estimated  value  of  9.  Using  this  motivation  LM  is  called 
the  score  test  because  3  In  L/89  is  the  score  vector. 

An  alternative  motivation  is  to  measure  the  closeness  to  zero  of  the  Lagrange  mul¬ 
tipliers  of  the  constrained  optimization  problem  for  the  restricted  MLE.  Maximizing 
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In  L(0)  —  Arh(0)  with  respect  to  0  implies  that 


3  In  L  _  3h(0)' 
3 0  g_  ~~  30 


(7.24) 


It  follows  that  tests  based  on  the  estimated  Lagrange  multipliers  A,  are  equivalent  to 
tests  based  on  the  score  3  In  L/30|g  ,  since  3h/30'  is  assumed  to  be  of  full  rank. 

Implementation  requires  obtaining  the  asymptotic  distribution  of  3  In  L/30|g  .  This 
leads  to  the  Lagrange  multiplier  test  or  score  test  statistic 


LM  =  -A-1 


3  In  L 

30' 


3  In  L 

~W~ 


(7.25) 


where  A  is  a  consistent  estimate  of  Ao  in  (7.22)  evaluated  at  0;.  rather  than  0„. 

The  LM  test,  due  to  Aitchison  and  Silvey  (1958)  and  Silvey  (1959),  is  equivalent  to 
the  score  test,  due  to  Rao  (1947).  The  test  statistic  LM  is  usually  derived  by  obtaining 
an  analytical  expression  for  the  score  rather  than  the  Lagrange  multipliers.  Econome¬ 
tricians  usually  call  the  test  an  LM  test,  even  though  a  clearer  terminology  is  to  call  it 
a  score  test. 


Discussion 

Good  intuition  is  provided  by  the  expository  graphical  treatment  of  the  three  tests  by 
Buse  (1982)  that  views  all  three  tests  as  measuring  the  change  in  the  log-likelihood. 
Here  we  provide  a  verbal  summary. 

Consider  scalar  parameter  and  a  Wald  test  of  whether  9q  —  6*  =0.  Then  a  given 
departure  of  0„  from  6*  will  translate  into  a  larger  change  in  InL,  the  more  curved 
is  the  log-likelihood  function.  A  natural  measure  of  curvature  is  the  second  derivative 
H(6)  =  32  In  L/302.  This  suggests  W=  -(?„  -  0*)2 A(0„).  The  statistic  W  in  (7.23) 
can  be  viewed  as  a  generalization  to  vector  0  and  more  general  restrictions  h(0o)  with 
A A  measuring  the  curvature. 

For  the  score  test  Buse  shows  that  a  given  value  of  3  In  L/30  |g  translates  into  a 
larger  change  in  InL,  the  less  curved  is  the  log-likelihood  function.  This  leads  to  use 
of  (A A)”1  in  (7.25).  And  the  statistic  LR  directly  compares  the  log-likelihoods. 


An  Illustration 

To  illustrate  the  three  tests  consider  an  iid  example  with  y,-  ~  7V[/x0,  1]  and  test  of 
Ho  :  no  =  /i*.  Then  /x„  =  y  and  Jir  =  /i*. 

For  the  LR  test,  In  L(/x)  =  —  ^  In  2tt  —  |  ^A(y,  —  /i)2  and  some  algebra  yields 

LR  =  2 [In  L(y)  -  In  L(/x*)]  =  N(y  -  /x*)2. 

The  Wald  test  is  based  on  whether  y  —  ji*  ~  0.  Here  it  is  easy  to  show  that  y  — 
li*  ~  Af[0,  1/A]  under  Hq,  leading  to  the  quadratic  form 

This  simplifies  to  N(y  —  /i *)2  and  so  here  W  =  LR. 
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The  LM  test  is  based  on  closeness  to  zero  of  3  In  L(p)/dp |M»  =  £T(y,-  —  p)\n*  = 
N(y  —  p*).  This  is  just  a  rescaling  of  (y  —  p*)  so  LM  =  W.  More  formally,  A(p ,*)  =  — 
1  since  32  \nL(p)/dp2  =  —N  and  (7.25)  yields 

LM  =  N~l(N(y  -  p*))[l]~l{N(y  -  p*)). 

This  also  simplifies  to  N(y  —  p*)2  and  verifies  that  LM  =  W  =  LR. 

Despite  their  quite  different  motivations,  the  three  test  statistics  are  equivalent  here. 
This  exact  equivalence  is  special  to  this  example  with  constant  curvature  owing  to  a 
log-likelihood  quadratic  in  p.  More  generally  the  three  test  statistics  differ  in  finite 
samples  but  are  equivalent  asymptotically  (see  Section  7.3.4). 


7.3.2.  Poisson  Regression  Example 

Consider  testing  exclusion  restrictions  in  the  Poisson  regression  model  introduced  in 
Section  5.2.  This  example  is  mainly  pedagogical  as  in  practice  one  should  perform 
statistical  inference  for  count  data  under  weaker  distributional  assumptions  than  those 
of  the  Poisson  model  (see  Chapter  20). 

If  y  given  x  is  Poisson  distributed  with  conditional  mean  exp(x'/3)  then  the  log- 
likelihood  function  is 

In  L(f3)  =  Y^=i  {“  exp(x'/3)  +  y,x'/3  -  In  y, ! } .  (7.26) 

For  h  exclusion  restrictions  the  null  hypothesis  is  Hf)  :  h(/3)  =  (32  =  0,  where  (3  = 

(ft  ,&)'■ 

The  unrestricted  MLE  (3  maximizes  (7.26)  with  respect  to  (3  and  has  first-order 
conditions  JT(y;  —  exp(x^/3))x,  =  0.  The  limit  variance  matrix  is  —A"1,  where 

A  =  —  plim  N~ 1  ,  exp  (x- /3)x, x- . 

The  restricted  MLE  is  (3  =  (fi\ ,  0'/,  where  (3{  maximizes  (7.26)  with  respect  to  (3X, 
with  XJ3  replaced  by  xj  ■  (3 ,  since  (32  =  0.  Thus  /3,  solves  the  first-order  conditions 
E,(.Vi  -  exp(x,l(/3l  ))xi/  =  0. 

The  LR  test  statistic  (7.21)  is  easily  calculated  from  the  fitted  log-likelihoods  of  the 
restricted  and  unrestricted  models. 

The  Wald  test  statistic  for  exclusion  restrictions  from  Section  7.2.5  is  W  = 
—N/32'A.22l32,  where  A22  is  the  (2,2)  block  of  A”1  and  A  =  —  iW1  exp  (x-/3 )x,-x-. 

The  LM  test  is  based  on  3  In  L(j3)/d/3  =  x,  i_v,-  —  exp  ix'/3)).  At  the  restricted 
MLE  this  equals  x,  n,  ,  where  «,  =  y,  —  exp  (x'|(./3|  )  is  the  residual  from  estimation 
of  the  restricted  model.  The  LM  test  statistic  (7.25)  is 

LM  =  [XX,  x'“']  exp fxj 3 1 )Xj x' ]  [X!,=i  x'“']  ■  (7-27) 

Some  further  simplification  is  possible  since  JT  X\,Ti,  =  0  from  the  first-order  condi¬ 
tions  for  the  restricted  MLE  given  earlier.  The  LM  test  here  is  based  on  the  correlation 
between  the  omitted  regressors  and  the  residual,  a  result  that  is  extended  to  other  ex¬ 
amples  in  Section  7.3.5. 
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In  general  it  can  be  difficult  to  obtain  an  algebraic  expression  for  the  LM  test.  For 
standard  applications  of  the  LM  test  this  has  been  done  and  is  incorporated  into  com¬ 
puter  packages.  Computation  by  auxiliary  regression  may  also  be  possible  (see  Sec¬ 
tion  3.5). 

7.3.3.  Derivation  of  Tests 

The  distribution  of  the  Wald  test  was  formally  derived  in  Section  7.2.4.  Proofs  for  the 
likelihood  ratio  and  Lagrange  multiplier  tests  are  more  complicated  and  we  merely 
sketch  them  here. 

Likelihood  Ratio  Test 

For  simplicity  consider  the  special  case  where  the  null  hypothesis  is  0  =  0,  so  that 
there  is  no  estimation  error  in  9r  =  9.  Taking  a  second-order  Taylor  series  expansion 
of  In  L{6)  about  In  L(9U)  yields 

—  ^  3  In  L  —  ^  1  —  ^  ,  32 In  L  —  ^ 

In  L(6)  =  In  L(0„)  +  - ^  (9  -9U)+  -( 9  -  9J - 7  (9  -  9U)  +  R, 

30'  du  2  3030'  du 

where  R  is  a  remainder  term.  Since  3  In  L / 30|^  =  0  by  the  first-order  conditions,  this 
implies  upon  rearrangement  that 

-2  [in  L(9)  -  In L(0„)]  =  -(0  -  0,,)'  ^44  ^  (0  -  0„)  +  R.  (7.28) 

The  right-hand  side  of  (7.28)  is  y2(h)  under  Hq  :  0  =  0  since  by  standard  results 
\f~N(9u  —  9)  -a-  AT  [0,  — [plim  /V  1 3 2  In  L/3030  T  For  derivation  of  the  limit  dis¬ 
tribution  of  LR  in  the  general  case  see,  for  example,  Amemiya  (1985,  p.  143). 

A  reason  for  preferring  LR  is  that  by  the  Neyman-Pearson  (1933)  lemma  the  uni¬ 
formly  most  powerful  test  for  testing  a  simple  null  hypothesis  versus  simple  alternative 
hypothesis  is  a  function  of  the  likelihood  ratio  L(0r)/L(0„),  though  not  necessarily  the 
specific  function  —2  In (L(9r)/L(9u))  that  equals  LR  given  in  (7.21)  and  gives  the  test 
statistic  its  name. 

LM  or  Score  Test 

By  a  first-order  Taylor  series  expansion 

1  3  In  L  1  3  In  L  1  32  In  L  , —  ~ 

~d9~  0o  +  N  30307VA?(  r  “  o)’ 

and  both  terms  in  the  right-hand  side  contribute  to  the  limit  distribution.  Then  the 
X2(h)  distribution  of  LM  defined  in  (7.25)  follows  since  it  can  be  shown  that 

R0Aq  1  -j=  ^  g  4  M  [0,  RoA-'BoA-'R'] ,  (7.29) 


237 


HYPOTHESIS  TESTS 


where  details  are  provided  in  Wooldridge  (2002,  p.  365),  for  example,  and  Ro  and  Ao 
are  defined  in  (7.4)  and  (7.22)  and 


B()  =  plim  N  1 


3  In  L  3  In  L 
30  30' 


0O 


(7.30) 


Result  (7.29)  leads  to  a  chi-square  statistic  that  is  much  more  complicated 
than  (7.25),  but  simplification  to  (7.25)  then  occurs  by  the  information  matrix 
equality. 


7.3.4.  Which  Test? 

Choice  of  test  procedure  is  usually  made  based  on  existence  of  robust  versions,  finite- 
sample  performance,  and  ease  of  computation. 


Asymptotic  Equivalence 

All  three  test  statistics  are  asymptotically  distributed  as  /2(/i)  under  Hq.  Further¬ 
more,  all  three  can  be  shown  to  be  noncentral  y  2(h\  a)  distributed  with  the  same 
noncentrality  parameter  under  local  alternatives.  Details  are  provided  for  the  Wald 
test  in  Section  7.6.3.  So  the  tests  all  have  the  same  asymptotic  power  against  local 
alternatives. 

The  finite-sample  distributions  of  the  three  statistics  differ.  In  the  linear  regression 
model  with  normality,  a  variant  of  the  Wald  test  statistic  for  h  linear  restrictions  on 
0  exactly  equals  the  F(h,  N  —  K)  statistic  (see  Section  7.2.1)  whereas  no  analytical 
results  exist  for  the  LR  and  LM  statistics.  More  generally,  in  nonlinear  models  exact 
small-sample  results  do  not  exist. 

In  some  cases  an  ordering  of  the  values  taken  by  the  three  test  statistics  can  be 
obtained.  In  particular  for  tests  of  linear  restrictions  in  the  linear  regression  model 
under  normality,  Berndt  and  Savin  (1977)  showed  that  Wald  >  LR  >  LM.  This  result 
is  of  little  theoretical  consequence,  as  the  test  least  likely  to  reject  under  the  null  will 
have  the  smallest  actual  size  but  also  the  smallest  power.  However,  it  is  of  practical 
consequence  for  the  linear  model,  as  it  means  when  testing  at  fixed  nominal  size  a 
that  the  Wald  test  will  always  reject  Hq  more  often  than  the  LR,  which  in  turn  will 
reject  more  often  than  the  LM  test.  The  Wald  test  would  be  preferred  by  a  researcher 
determined  to  reject  Hq.  This  result  is  restricted  to  linear  models. 


Invariance  to  Reparameterization 

The  Wald  test  is  not  invariant  to  algebraically  equivalent  parameterizations  of  the  null 
hypothesis  (see  Section  7.2.9)  whereas  the  LR  test  is  invariant.  Some  but  not  all  ver¬ 
sions  of  the  LM  test  are  invariant.  The  LM  test  is  generally  invariant  if  the  expected 
Hessian  (see  Section  5.5.2)  is  used  to  estimate  Ao  and  not  invariant  if  the  Hessian  is 
used.  The  test  LM*  defined  later  in  (7.34)  is  invariant.  The  lack  of  invariance  for  the 
Wald  test  is  a  major  weakness. 
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Robust  Versions 

In  some  cases  with  misspecified  density  the  quasi-MLE  (see  Section  5.7)  remains  con¬ 
sistent.  The  Wald  test  is  then  easily  robustified  (see  Section  7.2).  The  LM  test  can  be 
robustified  with  more  difficulty;  see  (7.38)  in  Section  7.5.1  for  a  general  result  for  m- 
estimators  and  Section  8.4  for  some  robust  LM  test  examples.  The  LR  test  is  no  longer 
chi-square  distributed,  except  in  a  special  case  given  later  in  (7.39).  Instead,  the  LR 
test  is  a  mixture  of  chi-squares  (see  Section  8.5.3). 


Convenience 

Convenience  in  computation  is  also  a  consideration.  LR  requires  estimation  of  the 
model  twice,  once  with  and  once  without  the  restrictions  of  the  null  hypothesis.  If 
done  by  a  package,  it  is  easily  implemented  as  one  need  only  read  off  the  printed  log- 
likelihood  routinely  printed  out,  subtract,  and  multiply  by  2.  Wald  requires  estimation 
only  under  Ha  and  is  best  to  use  when  the  unrestricted  model  is  easy  to  estimate.  Lor 
example,  this  is  the  case  for  restrictions  on  the  parameters  of  the  conditional  mean 
in  nonlinear  models  such  as  NLS,  probit,  Tobit,  and  logit.  The  LM  statistic  requires 
estimation  only  under  Hq  and  is  best  to  use  when  the  restricted  model  is  easy  to  esti¬ 
mate.  Examples  are  tests  for  autocorrelation  and  heteroskedasticity,  where  it  is  easiest 
to  estimate  the  null  hypothesis  model  that  does  not  have  these  complications. 

The  Wald  test  is  often  used  for  tests  of  statistical  significance  whereas  the  LM  test 
is  often  used  for  tests  of  correct  model  specification. 


7.3.5.  Interpretation  and  Computation  of  the  LM  test 


Lagrange  multiplier  tests  have  the  additional  advantages  of  simple  interpretation  in 
some  leading  examples  and  computation  by  auxiliary  regression. 

In  this  section  attention  is  restricted  to  the  usual  cross-section  data  case  of  a  scalar 
dependent  variable  independent  over  i,  so  that  3  In  L{0)/d0  =  JV  s,(0),  where 


s,(0)  = 


9  In /(yi  I  x/.  6) 

do 


(7.31) 


is  the  contribution  of  the  ith  observation  to  the  score  vector  of  the  unrestricted  model. 
Lrom  (7.25)  the  LM  test  is  a  test  of  the  closeness  to  zero  of  JV  Sj(6r). 


Simple  Interpretation  of  the  LM  Test 

Suppose  that  the  density  is  such  that  s(0)  factorizes  as 

s(0)=g(x,0)r(ylX>0)  (7.32) 

for  some  q  x  1  vector  function  g(-)  and  scalar  function  r(y,  x.  0 ).  the  latter  of  which 
may  be  interpreted  as  a  generalized  residual  because  y  appears  in  r(-)  but  not  g(  ).  Lor 
example,  for  Poisson  regression  3  In  f  /dO  =  x(y  —  exp(x'/3)). 
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Given  (7.32)  and  independence  over  i,  d  In  L/'<)0\'(j  =  JT  g,?j,  where  g,  = 
g(x,-,  9r)  and  ?,■  =  rty, .  x, .  9r).  The  LM  test  can  therefore  be  simply  interpreted  as 
a  score  test  of  the  correlation  between  g,  and  the  residual  7, .  This  interpretation  was 
given  in  Section  7.3.2  for  the  LM  test  with  Poisson  regression,  where  g,  =  x,  and 
r,  =  yt  -  exp(x'1;.3i). 

The  partition  (7.32)  will  arise  whenever  f(y )  is  based  on  a  one-parameter  den¬ 
sity.  In  particular,  many  common  likelihood  models  are  based  on  one-parameter  LEF 
densities,  with  parameter  fi  then  modeled  as  a  function  of  x  and  (3.  In  the  LEF  case 
r(y,  x,  9)  =  (y  —  E[y|x])  (see  Section  5.7.3),  so  the  generalized  residual  r(-)  in  (7.32) 
is  then  the  usual  residual. 

More  generally  a  partition  similar  to  (7.32)  will  also  arise  when  f(y)  is  based  on  a 
two-parameter  density,  the  information  matrix  is  block  diagonal  in  the  two  parameters, 
and  the  two  parameters  in  turn  depend  on  regressors  and  parameter  vectors  (3  and  a 
that  are  distinct.  Then  LM  tests  on  /3  are  tests  of  correlation  of  g^,  and  7 fa,  where 
s(/3)  =  g^tx.  9)rp(y,  x,  9),  with  similar  interpretation  for  LM  tests  on  a. 

A  leading  example  is  linear  regression  under  normality  with  two  parameters  /i  and 
cr2  modeled  as  /i  =  x'/3  and  o2  =  a  or  a2  =  cr2(z,  a).  For  exclusion  restrictions  in  lin¬ 
ear  regression  under  normality,  st  (J3)  =  x,  (y,-  —  x-/3)  and  the  LM  test  is  one  of  correla¬ 
tion  between  regressors  x,  and  the  restricted  model  residual  Ti,  =  y,  —  x(  j0l.  For  tests 
of  heteroskedasticity  with  a2  =  exp(ar  +  z'ai),  s,(a)  =4z,((y;  —  xJ/3 )2 /a2)  —  1), 
and  the  LM  test  is  one  of  correlation  between  z,  and  the  squared  residual  u2  = 
(_y,-  —  x' /3)2,  since  a2  is  constant  under  the  null  hypothesis  that  a 2  =  0. 


Outer  Product  of  the  Gradient  Versions  of  the  LM  Test 

Now  return  to  the  general  Sj(9)  defined  in  (7.31).  We  show  in  the  following  that  an 
asymptotically  equivalent  version  of  the  LM  test  statistic  (7.25)  can  be  obtained  by 
running  the  auxiliary  regression  or  artificial  regression 

l  =  s/7  +  Vi,  (7.33) 

where  =  .v,  (#,.),  and  computing 

LM*  =  NR2,  (7.34) 

where  R2  is  the  uncentered  R2  defined  after  (7.36).  LM*  is  asymptotically  /2(h )  under 
Hq.  Equivalently,  LM*  equals  ESS„,  the  uncentered  explained  sum  of  squares  (the  sum 
of  squares  of  the  fitted  values),  or  equals  N—  RSS,  where  RSS  is  the  residual  sum  of 
squares,  from  regression  (7.33). 

This  result  can  be  easy  to  implement  as  in  many  applications  it  can  be  quite  simple 
to  analytically  obtain  s,-(0),  generate  data  for  the  q  components  su , ,  sc/j ,  and  regress 
1  on  s^, . . . ,  s’  •.  Note  that  here  /(y;  |x,  ,  9)  in  (7.31)  is  the  density  of  the  unrestricted 
model. 

For  the  exclusion  restrictions  in  the  Poisson  model  example  in  Section  7.3.2, 
Si(j3)=  (yi  —  exp  (x'/djjx,  and  x-/3,.  =  x'|(/3lr.  It  follows  that  LM*  can  be  computed 
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as  N from  regressing  1  on  tv,  —  exp  (x'u0lr))xj,  where  x,  contains  both  xi,  and  X2 
and  0lr  is  obtained  from  Poisson  regression  of  y,  on  xi,  alone. 

Equations  (7.33)  and  (7.34)  require  only  independence  over  i.  Other  auxiliary  re¬ 
gressions  are  possible  if  further  structure  is  assumed.  In  particular,  specialize  to  cases 
where  s(9)  factorizes  as  in  (7.32),  and  define  r(y,  x,  9)  so  that  V[r(y,  x,  9)]  =  1.  Then 
an  alternative  asymptotically  equivalent  version  of  the  LM  test  is  NR2  from  regression 
of  7j  on  g, .  This  includes  LM  tests  for  linear  regression  under  normality,  such  as  the 
Breusch-Pagan  LM  test  for  heteroskedasticity. 

These  alternative  versions  of  the  LM  test  are  called  outer-product-of-the-gradient 
versions  of  the  LM  test,  as  they  replace  —  Ao  in  (7.22)  by  an  outer-product-of-the- 
gradient  (OPG)  estimate  or  BHHH  estimate  of  Bo.  Although  they  are  easily  computed, 
OPG  variants  of  LM  tests  can  have  poor  small-sample  properties  with  large  size  distor¬ 
tions.  This  has  discouraged  use  of  the  OPG  form  of  the  LM  test.  These  small-sample 
problems  can  be  greatly  reduced  by  bootstrapping  (see  Section  1 1.6.3).  Davidson  and 
MacKinnon  (1984)  propose  double-length  auxiliary  regressions  that  also  perform  bet¬ 
ter  in  finite  samples. 


Derivation  of  the  OPG  Version 


To  derive  LM*,  first  note  that  in  (7.25),  31nL(d)/30|g  =  Second,  by  the 
information  matrix  equality  Ao  =  —Bo  and,  from  Section  5.5.2,  B0  can  be  consis¬ 
tently  estimated  under  H(t  by  the  OPG  estimate  or  BHHH  estimate  N  1  ?,  s' .  Com¬ 

bining,  these  results  gives  an  asymptotically  equivalent  version  of  the  LM  test  sta¬ 
tistic  (7.25): 


(7.35) 


This  statistic  can  be  computed  from  an  auxiliary  regression  of  1  on  s)  as  follows. 
Define  S  to  be  the  N  x  q  matrix  with  zth  row?.,  and  define  1  to  be  the  N  x  1  vector  of 
ones.  Then 

LM*  =  l'SfS'Sr^'l  =  ESS„  =  NR;r  (7.36) 

In  general  for  regression  of  y  on  X  the  uncentered  explained  sums  of  squares  (ESS„) 
is  y'XtX'X)  'X'v,  which  is  exactly  of  the  form  (7.36),  whereas  the  uncentered  R2  is 
R2  =  y,X(X,X)“1X,y/y,y,  which  here  is  (7.36)  divided  by  11  =  N .  The  term  uncen¬ 
tered  is  used  because  in  R2  division  is  by  the  sum  of  squared  deviations  of  y  around 
zero  rather  than  around  the  sample  mean. 


7.4.  Example:  Likelihood-Based  Hypothesis  Tests 

The  various  test  procedures  -  Wald,  LR,  and  LM  -  are  illustrated  using  generated  data 
from  the  dgp  y  |x  Poisson  distributed  with  mean  exp(/3j  +  P2X2  +  (hxi  +  Ada),  where 
A  =  0  and  A  =  A  =  A  =  0.1  and  the  three  regressors  are  iid  draws  from  Ar[0,  1]. 
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Table  7.1.  Test  Statistics  for  Poisson  Regression  Example a 


Null  Hypothesis 

Wald 

Test  Statistic 

LR  LM 

LM* 

InL 

Result 
at  level  0.05 

O 

II 

co 

0 

5.904 

5.754 

5.916 

6.218 

-241.648 

Reject 

(0.015) 

(0.016) 

(0.015) 

(0.013) 

H20  :ft3  =  0,ft4  =  0 

8.570 

8.302 

8.575 

9.186 

-242.922 

Reject 

(0.014) 

(0.016) 

(0.014) 

(0.010) 

Hi0  :  fty  —  ft>4  =  0 

0.293 

0.293 

0.293 

0.315 

-238.918 

Do  not  reject 

(0.588) 

(0.589) 

(0.588) 

(0.575) 

H40  :  ftnlftn  -1  =  0 

0.158 

0.293 

0.293 

0.315 

-238.918 

Do  not  reject 

(0.691) 

(0.589) 

(0.588) 

(0.575) 

°  The  dgp  for  y  is  the  Poisson  distribution  with  parameter  exp(0.0  +  0.1^2  +  O.I.X3  +  O.I.X4)  and  sample  size 
N  =  200.  Test  statistics  are  given  with  associated  p-values  in  parentheses.  Tests  of  the  second  hypothesis  are 
X2(2)  and  the  other  tests  are  /2(1)  distributed.  Log-likelihoods  for  restricted  ML  estimation  are  also  given;  the 
log-likelihood  in  the  unrestricted  model  is  —238.772. 


Poisson  regression  of  y  on  an  intercept,  X2,  xy,  and  X4  for  a  generated  sample  of  size 
200  yielded  unrestricted  MLE 

E[y\x]  =  exp(— 0.165  -  0.028.r2  +  0.163.r3  +  0.103.r4), 

(-2.14)  (-0.36)  (2.43)  (0.08) 

where  associated  /-statistics  are  given  in  parentheses  and  the  unrestricted  log- 
likelihood  is  —238.772. 

The  analysis  tests  four  different  hypotheses,  detailed  in  the  first  column  of  Table  7.1. 
The  estimator  is  nonlinear,  whereas  the  hypotheses  are  examples  of,  respectively,  sin¬ 
gle  exclusion  restriction,  multiple  exclusion  restriction,  linear  restrictions,  and  nonlin¬ 
ear  restrictions.  The  remainder  of  the  table  gives  four  asymptotically  equivalent  test 
statistics  of  these  hypotheses  and  their  associated  //-values.  For  this  sample  all  tests  re¬ 
ject  the  first  two  hypotheses  and  do  not  reject  the  remaining  two,  at  significance  level 
0.05. 

The  Wald  test  statistic  is  computed  using  (7.23).  This  requires  estimation  of  the  un¬ 
restricted  model,  given  previously,  to  obtain  the  variance  matrix  estimate  of  the  unre¬ 
stricted  MLE.  Wald  tests  of  different  hypotheses  then  require  computation  of  different 
h  and  R  and  simplify  in  some  cases.  The  Wald  chi-square  test  of  the  single  exclu¬ 
sion  restriction  is  just  the  square  of  the  usual  f-test,  with  2.432  ~  5.90.  The  Wald  test 
statistic  of  the  joint  exclusion  restrictions  is  detailed  in  Section  7.2.5.  Here  x3  is  sta¬ 
tistically  significant  and  X4  is  statistically  insignificant,  whereas  jointly  .t3  and  X4  are 
statistically  significant  at  level  0.05.  The  Wald  test  for  the  third  hypothesis  is  given  in 
(7.19)  and  leads  to  nonrejection.  The  third  and  fourth  hypotheses  are  equivalent,  since 
fy/ f 4  —  1  =0  implies  ft]  =  ft  4,  but  the  Wald  test  statistic  for  the  fourth  hypothesis, 
given  in  (7.13),  differs  from  (7.19).  The  statistic  (7.13)  was  calculated  using  matrix 
operations,  as  most  packages  will  at  best  calculate  Wald  tests  of  linear  hypotheses. 

The  LR  test  statistic  is  especially  easy  to  compute,  using  (7.21),  given  estima¬ 
tion  of  the  restricted  model.  For  the  first  three  hypotheses  the  restricted  model  is 
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estimated  by  Poisson  regression  of  y  on,  respectively,  regressors  f  I ,  X2 ,  x4),  (1,  X2),  and 
(1,  X2,  X3  +  X4),  where  the  third  regression  uses  ^3X3  +  f5 4X4  =  x 3  +  jc4)  if  /J3  =  /J4. 
As  an  example  of  the  LR  test,  for  the  second  hypothesis  LR=  —  2[— 238.772  — 
(—242.922)]  =  8.30.  The  fourth  restricted  model  in  theory  requires  ML  estimation 
subject  to  nonlinear  constraints  on  the  parameters,  which  few  packages  do.  However, 
constrained  ML  estimation  is  invariant  to  the  way  the  restrictions  are  expressed,  so 
here  the  same  estimates  are  obtained  as  for  the  third  restricted  model,  leading  to  the 
same  LR  test  statistic. 

The  LM  test  statistic  is  computed  using  (7.25),  which  for  the  Poisson  model  spe¬ 
cializes  to  (7.27).  This  statistic  is  computed  using  matrix  commands,  with  different 
restrictions  leading  to  the  different  restricted  MLE  estimates  0.  As  for  the  LR  test, 
the  LM  test  is  invariant  to  transformations,  so  the  LM  tests  of  the  third  and  fourth 
hypotheses  are  equivalent. 

An  asymptotically  equivalent  version  of  the  LM  test  statistic  is  the  statistic 
LM*  given  in  (7.35).  This  can  be  computed  as  the  explained  sum  of  squares 
from  the  auxiliary  regression  (7.33).  For  the  Poisson  model  s  -  =  3  In  /(yd/dfij  = 
(}’,■  —  expix'/3))x/(,  with  evaluation  at  the  appropriate  restricted  MLE  for  the  hypothe¬ 
sis  under  consideration.  The  statistic  LM*  is  simpler  to  compute  than  LM,  though  like 
LM  it  requires  restricted  ML  estimates. 

In  this  example  with  generated  data  the  various  test  statistics  are  very  similar.  This 
is  not  always  the  case.  In  particular,  the  test  statistic  LM*  can  have  poorer  finite-sample 
size  properties  than  LM,  even  if  the  dgp  is  known.  Also,  in  applications  with  real  data 
the  dgp  is  unlikely  to  be  perfectly  specified,  leading  to  divergence  of  the  various  test 
statistics  even  in  infinitely  large  samples. 


7.5.  Tests  in  Non-ML  Settings 

The  Wald  test  is  the  standard  test  to  use  in  non-ML  settings.  From  Section  7.2  it  is  a 
general  testing  procedure  that  can  always  be  implemented,  using  an  appropriate  sand¬ 
wich  estimator  of  the  variance  matrix  of  the  parameter  estimates.  The  only  limitation 
is  that  in  some  applications  unrestricted  estimation  may  be  much  more  difficult  to 
perform  than  restricted  estimation. 

The  LM  or  score  test,  based  on  departures  from  zero  of  the  gradient  vector  of  the 
unrestricted  model  evaluated  at  the  restricted  estimates,  can  also  be  generalized  to 
non-ML  estimators.  The  form  of  the  LM  test,  however,  is  usually  considerably  more 
complicated  than  in  the  ML  case.  Moreover,  the  simplest  forms  of  the  LM  test  statistic 
based  on  auxiliary  regressions  are  usually  not  robust  to  distributional  misspecification. 

The  LR  test  is  based  on  the  difference  between  the  maximized  values  of  the  objec¬ 
tive  function  with  and  without  restrictions  imposed.  This  usually  does  not  generalize 
to  objective  functions  other  than  the  likelihood  function,  as  this  difference  is  usually 
not  chi-square  distributed. 

For  completeness  we  provide  a  condensed  presentation  of  extension  of  the  ML  tests 
to  m-estimators  and  to  efficient  GMM  estimators.  As  already  noted,  in  most  applica¬ 
tions  use  of  the  simpler  Wald  test  is  sufficient. 
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7.5.1.  Tests  Based  on  m-Estimators 

Tests  for  m-estimators  are  straightforward  extensions  of  those  for  ML  estimators,  ex¬ 
cept  that  it  is  no  longer  possible  to  use  the  information  matrix  equality  to  simplify  the 
test  statistics  and  the  LR  test  generalizes  in  only  very  special  cases.  The  resulting  test 
statistics  are  asymptotically  /2(/r)  distributed  under  H$  :  h(0)  =  0  and  have  the  same 
noncentral  chi-square  distribution  under  local  alternatives. 

Consider  m-estimators  that  maximize  Qn(9)  =  N~l  qi(9)  with  first-order  con¬ 
ditions  N _I  Y2i  s i(0)  =  0.  Define  the  q  x  q  matrices  A (9)  =  N~l  3s j(9)/d9'  and 

B (9)  =  A-1  ^2iSi(6)si(9)'  and  the  h  x  q  matrix  R(0)  =  3  lnh(0)/3#'.  Let  9U  and 
9r  denote  unrestricted  and  restricted  estimators,  respectively,  and  let  A  =  A (6U) 
and  A  =  A (9r)  with  similar  notation  for  B  and  R.  Linally,  let  h  =  h(0„)  and  s’,-  = 
s  i(9r). 

The  Wald  test  statistic  is  based  on  closeness  of  h  to  zero.  Here 


W  = 


h'  |^RAi_1A_1BA_1R  j 


(7.37) 


since  from  Section  5.5.1  the  robust  variance  matrix  estimate  for  9U  is  iV_1A_1BA_1. 
Packages  with  the  option  of  robust  standard  errors  use  this  more  general  form  to  com¬ 
pute  Wald  tests  of  statistical  significance. 

Let  g (9)  =  3  In  QN(9)/'d9  denote  the  gradient  vector,  and  let  g  =  g (9r)  =  JL  s). 
The  LM  test  statistic  is  based  on  the  closeness  of  g  to  0  and  is  given  by 


LM  =  Mg' 


-'r'(i 


RA_1BA'_1R 


,  -l 


RA 


-l 


-l 


(7.38) 


a  result  obtained  by  forming  a  chi-square  test  statistic  based  on  (7.29),  where  N g  re¬ 
places  |3  In  L/30|g  .  This  test  is  clearly  not  as  simple  to  implement  as  a  robust  Wald 
test.  Some  examples  of  computation  of  the  robust  form  of  LM  tests  are  given  in  Sec¬ 
tion  8.4.  The  standard  implementations  of  LM  tests  in  computer  packages  are  often 
not  robust  versions  of  the  LM  test. 

The  LR  test  does  not  generalize  easily.  It  does  generalize  to  m-estimators  if 
B0  =  —aA()  for  some  scalar  a,  a  weaker  version  of  the  IM  equality.  In  such  special 
cases  the  quasi-likelihood  ratio  (QLR)  test  statistic  is 


QLR  =  —2 N  [Qn{9,.)  -  QN(fiuj\  fau,  (7.39) 


where  au  is  a  consistent  estimate  of  a  obtained  from  unrestricted  estimation  (see 
Wooldridge,  2002,  p.  370).  The  condition  Bo  =  —  aAo  holds  for  generalized  linear 
models  (see  Section  5.7.4).  Then  the  statistic  QLR  is  equivalent  to  the  difference  of  de- 
viances  for  the  restricted  and  unrestricted  models,  a  generalization  of  the  F-test  based 
on  the  difference  between  restricted  and  unrestricted  sum  of  squared  residuals  for  OLS 
and  NLS  estimation  with  homoskedastic  errors.  For  general  quasi-ML  estimation,  with 
Bo  /  — aAo,  the  LR  test  statistic  can  be  distributed  as  a  weighted  sum  of  chi-squares 
(see  Section  8.5.3). 
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7.5.2.  Tests  Based  on  Efficient  GMM  Estimators 

For  GMM  the  various  test  statistics  are  simplest  for  efficient  GMM,  meaning  GMM 
estimation  using  the  optimal  weighting  matrix.  This  poses  no  great  practical  restriction 
as  the  optimal  weighting  matrix  can  always  be  estimated,  as  detailed  in  Section  6.3.5. 

Consider  GMM  estimation  based  on  the  moment  condition  E[m,(0)]  =  0.  (Note 
the  change  in  notation  from  Chapter  6:  h(0)  is  being  used  in  the  current  chapter  to 
denote  the  restrictions  under  Hq.)  Using  the  notation  introduced  in  Section  6.3.5,  the 
efficient  unrestricted  GMM  estimator  9„  minimizes  Q  v  ( 0 )  =  gjv(0)Vgw(0),  where 
gN(6)  =  N~{  JT  m,(0)  and  Sjy  is  consistent  for  So  =  V[gjy(0)]-  The  restricted  GMM 
estimator  9r  is  assumed  to  minimize  Q\>(9)  with  the  same  weighting  matrix  Sy1 , 
subject  to  the  restriction  h(0)  =  0. 

The  three  following  test  statistics,  summarized  by  Newey  and  West  (1987a)  are 
asymptotically  /2(/z)  distributed  under  Hq  :  \\(9)  =  0  and  have  the  same  noncentral 
chi-square  distribution  under  local  alternatives. 

The  Wald  test  statistic  as  usual  is  based  on  closeness  of  h  to  zero.  This  yields 

W  =  h'^RN“1(G'S"1Gr1R'j  h,  (7.40) 

since  the  variance  of  the  efficient  GMM  estimator  is  1V_1(G'S_1G)_I  from  Section 
6.3.5,  where  Gy(0)  =  <)g^(9)/()9'  and  the  carat  denotes  evaluation  at  9U. 

The  first-order  conditions  of  efficient  GMM  are  G'S~ 1  g  =  0.  The  LM  statistic  tests 
whether  this  gradient  vector  is  close  to  zero  when  instead  evaluated  at  9r,  leading  to 

LM  =  Afg'S_IG(G'S_1G)_IG'S_1g,  (7.41) 

where  the  tilda  denotes  evaluation  at  9r  and  we  use  the  Section  6.3.3  assumption  that 
VNgN(90)  A  Af[0,  S0],  so  y^G'S-'g  A  J\f  [0,  plimA^G'S-'G]. 

For  the  efficient  GMM  estimator  the  difference  in  maximized  values  of  the  objective 
function  can  also  be  compared,  leading  to  the  difference  test  statistic 

D  =  N[QN(0r)-QN(0u)\.  (7.42) 

Like  W  and  LM,  the  statistic  D  is  asymptotically  /2(/z)  distributed  under  Hq  : 
h(0)  =  0. 

Even  in  the  likelihood  case,  this  last  statistic  differs  from  the  LR  statistic  be¬ 
cause  it  uses  a  different  objective  function.  The  MLE  minimizes  Qn(9)  =  — /V  1 
JT  In  f(yl\9).  From  Section  6.3.7,  the  asymptotically  equivalent  efficient  GMM  es¬ 
timator  instead  minimizes  the  quadratic  form  Qn(9)  =  N~l  (JL  Sj(9))'  (JL  s,(9)'j, 
where  .s,(0)  =  i>  In  f(yi\9)/39.  The  statistic  D  can  be  used  in  general,  provided  the 
GMM  estimator  used  is  the  efficient  GMM  estimator,  whereas  the  LR  test  can  only  be 
generalized  for  some  special  cases  of  m-estimators  mentioned  after  (7.39). 

For  MM  estimators,  that  is,  in  the  just-identified  GMM  model,  D  =  LM  = 
N  Q^(9r),  so  the  LM  and  difference  tests  are  equivalent.  For  D  this  simplification  oc¬ 
curs  because  gN(9u )  =  0  and  so  Qn(9u )  =  0.  For  LM  simplification  occurs  in  (7.41) 
as  then  Gy  is  invertible. 
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7.6.  Power  and  Size  of  Tests 

The  remaining  sections  of  this  chapter  study  two  limitations  in  using  the  usual  com¬ 
puter  output  to  test  hypotheses. 

First,  a  test  can  have  little  ability  to  discriminate  between  the  null  and  alternative 
hypotheses.  Then  the  test  has  low  power,  meaning  there  is  a  low  probability  of  rejecting 
the  null  hypothesis  when  it  is  false.  Standard  computer  output  does  not  calculate  test 
power,  but  it  can  be  evaluated  using  asymptotic  methods  (see  this  section)  or  finite- 
sample  Monte  Carlo  methods  (see  Section  7.7).  If  a  major  contribution  of  an  empirical 
paper  is  the  rejection  or  nonrejection  of  a  particular  hypothesis,  there  is  no  reason  for 
the  paper  not  to  additionally  present  the  power  of  the  test  against  some  meaningful 
alternative  hypothesis. 

Second,  the  true  size  of  the  test  may  differ  substantially  from  the  nominal  size  of 
the  test  obtained  from  asymptotic  theory.  The  rule  of  thumb  that  sample  size  N  >  30 
is  sufficient  for  asymptotic  theory  to  provide  a  good  approximation  for  inference  on  a 
single  variable  does  not  extend  to  models  with  regressors.  Poor  approximation  is  most 
likely  in  the  tails  of  the  approximating  distribution,  but  the  tails  are  used  to  obtain 
critical  values  of  tests  at  common  significance  levels  such  as  5%.  In  practice  the  critical 
value  for  a  test  statistic  obtained  from  large-sample  approximation  is  often  smaller 
than  the  correct  critical  value  based  on  the  unknown  true  distribution.  Small-sample 
refinements  are  attempts  to  get  closer  to  the  exact  critical  value.  For  linear  regression 
under  normality  exact  critical  values  can  be  obtained,  using  the  t  rather  than  z  and  the 
F  rather  than  y  2  distribution,  but  similar  results  are  not  exact  for  nonlinear  regression. 
Instead,  small-sample  refinements  may  be  obtained  through  Monte  Carlo  methods  (see 
Section  7.7)  or  by  use  of  the  bootstrap  (see  Section  7.8  and  Chapter  11). 

With  modern  computers  it  is  relatively  easy  to  correct  the  size  and  investigate  the 
power  of  tests  used  in  an  applied  study.  We  present  this  neglected  topic  in  some 
detail. 


7.6.1.  Test  Size  and  Power 

Hypothesis  tests  lead  to  either  rejection  or  nonrejection  of  the  null  hypothesis.  Correct 
decisions  are  made  if  Ho  is  rejected  when  Ho  is  false  or  if  Ho  is  not  rejected  when  Ho 
is  true. 

There  are  also  two  possible  incorrect  decisions:  (1)  rejecting  Ho  when  Ho  is  true, 
called  a  type  I  error,  and  (2)  nonrejection  of  Ho  when  Ho  is  false,  called  a  type  II 
error.  Ideally  the  probabilities  of  both  errors  will  be  low,  but  in  practice  decreasing 
the  probability  of  one  type  of  error  comes  at  the  expense  of  increasing  the  probability 
of  the  other.  The  classical  hypothesis  testing  solution  is  to  fix  the  probability  of  a  type 
I  error  at  a  particular  level,  usually  0.05,  while  leaving  the  probability  of  a  type  II  error 
unspecified. 

Define  the  size  of  a  test  or  significance  level 

«  =  Pr  [type  I  error] 

—  Pr  [reject  Ho  \  Ho  true] , 
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with  common  choices  of  a  being  0.01,  0.05,  or  0.10.  A  hypothesis  is  rejected  if  the  test 
statistic  falls  into  a  rejection  region  defined  so  that  the  test  significance  level  equals  the 
specified  value  of  a.  A  closely  related  equivalent  method  computes  the  p- value  of  a 
test,  the  marginal  significance  level  at  which  the  null  hypothesis  is  just  rejected,  and 
rejects  Ht)  if  the  p- value  is  less  than  the  specified  value  of  a.  Both  methods  require  only 
knowledge  of  the  distribution  of  the  test  statistic  under  the  null  hypothesis,  presented 
in  Section  7.2  for  the  Wald  test  statistic. 

Consideration  should  also  be  given  to  the  probability  of  a  type  II  error.  The  power 
of  a  test  is  defined  to  be 


Power  =  Pr  [reject  /A,  |  Ha  true] 

=  1  —  Pr  [accept  Hq\H„  true]  (7.44) 

=  1  —  Pr  [Type  II  error] . 

Ideally,  test  power  is  close  to  one  since  then  the  probability  of  a  type  II  error  is  close  to 
zero.  Determining  the  power  requires  knowledge  of  the  distribution  of  the  test  statistic 
under  Ha. 

Analysis  of  test  power  is  typically  ignored  in  empirical  work,  except  that  test  proce¬ 
dures  are  usually  chosen  to  be  ones  that  are  known  theoretically  to  have  power  that,  for 
given  level  a,  is  high  relative  to  other  alternative  test  statistics.  Ideally,  the  uniformly 
most  powerful  (UMP)  test  is  used.  This  is  the  test  that  has  the  greatest  power,  for  given 
level  a,  for  all  alternative  hypotheses.  UMP  tests  do  exist  when  testing  a  simple  null 
hypothesis  against  a  simple  alternative  hypothesis.  Then  the  Neyman-Pearson  lemma 
gives  the  result  that  the  UMP  test  is  a  function  of  the  likelihood  ratio.  For  more  gen¬ 
eral  testing  situations  involving  composite  hypotheses  there  is  usually  no  UMP  test, 
and  further  restrictions  are  placed  such  as  UMP  one-sided  tests.  In  practice,  power 
considerations  are  left  to  theoretical  econometricians  who  use  theory  and  simulations 
applied  to  various  testing  procedures  to  suggest  which  testing  procedures  are  the  most 
powerful. 

It  is  nonetheless  possible  to  determine  test  power  in  any  given  application.  In  the 
following  we  detail  how  to  compute  the  asymptotic  power  of  the  Wald  test,  which 
equals  that  of  the  LR  and  LM  tests  in  the  fully  parametric  case. 


7.6.2.  Local  Alternative  Hypotheses 

Since  power  is  the  probability  of  rejecting  Hq  when  Ha  is  true,  the  computation 
of  power  requires  obtaining  the  distribution  of  the  test  statistic  under  the  alterna¬ 
tive  hypothesis.  For  a  Wald  chi-square  test  at  significance  level  a  the  power  equals 
Pr[W>  Ha  |.  Calculation  of  this  probability  requires  specification  of  a  particular 

alternative  hypothesis,  because  Ha  :  hid)  ^  0  is  very  broad. 

The  obvious  choice  is  the  fixed  alternative  h(0)  =  6,  where  6  is  an  h  x  1  finite 
vector  of  nonzero  constants.  The  quantity  6  is  sometimes  referred  to  as  the  hypoth¬ 
esis  error,  and  larger  hypothesis  errors  lead  to  greater  power.  For  a  fixed  alternative 
the  Wald  test  statistic  asymptotically  has  power  one  as  it  rejects  the  null  hypothesis 
all  the  time.  To  see  this  note  that  if  h(0)  =  6  then  the  Wald  test  statistic  becomes 
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infinite,  since 

W  =  h(RiV  'CRT'h 

4  <5'  (R0Ar_1C0Ro)_1  6, 

using  9  4  do,  so  h  =  h(0„)  4  h(0)  =  6,  and  C  4  Co-  It  follows  that  W  4  oo  since 
all  the  terms  except  N  are  finite  and  nonzero.  This  infinite  value  leads  to  Ho  being 
always  rejected,  as  it  should  be,  and  hence  having  perfect  power  of  one. 

The  Wald  test  statistic  is  therefore  a  consistent  test  statistic,  that  is,  one  whose 
power  goes  to  one  as  N  oo.  Many  test  statistics  are  consistent,  just  as  many  estima¬ 
tors  are  consistent.  More  stringent  criteria  are  needed  to  discriminate  among  the  test 
statistics,  just  as  relative  efficiency  is  used  to  choose  among  estimators. 

For  estimators  that  are  root- /V  consistent,  we  consider  a  sequence  of  local  alter¬ 
natives 

Ha  :  h(0)  =  S/VN,  (7.45) 

where  6  is  a  vector  of  fixed  constants  with  6  /  0.  This  sequence  of  alternative  hy¬ 
potheses,  called  Pitman  drift,  gets  closer  to  the  null  hypothesis  value  of  zero  as  the 
sample  size  gets  larger,  at  the  same  rate  V~N  as  used  to  scale  up  9  to  get  a  nonde¬ 
generate  distribution  for  the  consistent  estimator.  The  alternative  hypothesis  value  of 
h(0)  therefore  moves  toward  zero  at  a  rate  that  negates  any  improved  efficiency  with 
increased  sample  size.  For  a  much  more  detailed  account  of  local  alternatives  and  re¬ 
lated  literatures  see  McManus  (1991). 


7.6.3.  Asymptotic  Power  of  the  Wald  Test 

Under  the  sequence  of  local  alternatives  (7.45)  the  Wald  test  statistic  has  a  nondegen¬ 
erate  distribution,  the  noncentral  chi-square  distribution.  This  permits  determination 
of  the  power  of  the  Wald  test. 

Specifically,  as  is  shown  in  Section  7.7.4,  under  Ha  the  Wald  statistic  W  defined  in 
(7.6)  is  asymptotically  /2(fi  ;  A)  distributed,  where  /2(/z;  7.)  denotes  the  noncentral 
chi-square  distribution  with  noncentrality  parameter 

A  =  l-6'  (RoCqRoT1  6 ,  (7.46) 

and  Ro  and  Co  are  defined  in  (7.4)  and  (7.5).  The  power  of  the  Wald  test,  the  proba¬ 
bility  of  rejecting  Ho  given  the  local  alternative  Ha  is  true,  is  therefore 

Power  =  Pr[W  >  **(*) |W  ~  *£(*;  A)].  (7.47) 

Figure  7.1  plots  power  against  A  for  tests  of  a  scalar  hypothesis  (h  =  1)  at  the  com¬ 
monly  used  sizes  or  significance  levels  of  10%,  5%,  and  1%.  For  A  close  to  zero  the 
power  equals  the  size,  and  for  large  A  the  power  goes  to  one. 

These  features  hold  also  for  h  >  1.  In  particular  power  is  monotonically  increasing 
in  the  noncentrality  parameter  A  defined  in  (7.46).  Several  general  results  follow. 

First,  power  is  increasing  in  the  distance  between  the  null  and  alternative  hypo¬ 
theses,  as  then  6  and  hence  A  increase. 
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Test  Power  as  a  function  of  the  ncp 


Figure  7.1:  Power  of  Wald  chi-square  test  with  one  degree  of  freedom  for  three  different 
test  sizes  as  the  noncentrality  parameter  ranges  from  0  to  20. 

Second,  for  given  alternative  6  power  increases  with  efficiency  of  the  estimator  9, 
as  then  Co  is  smaller  and  hence  A  is  larger. 

Third,  as  the  size  of  the  test  increases  power  increases  and  the  probability  of  a  type  II 
error  decreases. 

Fourth,  if  several  different  test  statistics  are  all  /2(/r)  under  the  null  hypothesis 
and  noncentral- x 2  (/?)  under  the  alternative,  the  preferred  test  statistic  is  that  with  the 
highest  noncentrality  parameter  A  since  then  power  is  the  highest.  Furthermore,  two 
tests  that  have  the  same  noncentrality  parameter  are  asymptotically  equivalent  under 
local  alternatives. 

Finally,  in  actual  applications  one  can  calculate  the  power  as  a  function  of  6.  Speci¬ 
fically,  for  a  specified  alternative  6,  an  estimated  noncentrality  parameter  A  can  be 
computed  using  (7.46)  using  parameter  estimate  9  with  associated  estimates  R  and  C. 
Such  power  calculations  are  illustrated  in  Section  7.6.5. 

7.6.4.  Derivation  of  Asymptotic  Power 

To  obtain  the  distribution  of  W  under  Ha,  begin  with  the  Taylor  series  expansion  result 
(7.9).  This  simplifies  to 

VNh(9)  4  A f  [d,  RoCoR,/] ,  (7.48) 

under  Ha,  since  then  \Z~Nh(9)  =  6.  Thus  a  quadratic  form  centered  at  6  would  be 
chi-square  distributed  under  Ha . 

The  Wald  test  statistic  W  defined  in  (7.6)  instead  forms  a  quadratic  form  centered 
at  0  and  is  no  longer  chi-squared  distributed  under  Ha.  In  general  if  z  ~  Af\fi,  fi], 
where  rank(ST)  =  h,  then  z'fi  'z  ~  /2(/i;  A),  where  y2!/;;  A)  denotes  the  noncentral 
chi-square  distribution  with  noncentrality  parameter  A  =  1  [i.  Applying  this  re¬ 

sult  to  (7.48)  yields 

Ah^ytRoCoR^rV?)  4  X2(fc;A),  (7.49) 

under  Ha,  where  A  is  defined  in  (7.49). 
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7.6.5.  Calculation  of  Asymptotic  Power 

To  shed  light  on  how  power  changes  with  6,  consider  tests  of  coefficient  significance 
in  the  scalar  case.  Then  the  noncentrality  parameter  defined  in  (7.46)  is 

52  (s/y/NY 

X=  —  ~  (7.50) 

2c  2(se[0])2 

where  the  approximation  arises  because  of  estimation  of  c,  the  limit  variance  of 
\Z~N(9  —  9),  by  7V(se[f?])2,  where  se[6>]  is  the  standard  error  of  6. 

Consider  a  Wald  chi-square  test  of  Ho  :  6  =  0  against  the  alternative  hypothesis  that 
9  is  within  a  standard  errors  of  zero,  that  is,  against 

Ha  :  9  —  a  x  se[0], 

where  se[0]  is  treated  here  as  a  constant.  Then  8/\Z~N  in  (7.45)  equals  a  x  se[0],  so 
that  (7.50)  simplifies  to  X  =  a1  / 2.  Thus  the  Wald  test  is  asymptotically  yjt  I ;  X)  under 
Ha  where  X  =  a2  / 2. 

From  Figure  7.1  it  is  clear  for  the  common  case  of  significance  level  tests  at  5%  that 
if  a  =  2  the  power  is  well  below  0.5,  if  a  =  4  the  power  is  around  0.5,  and  if  a  =  6  the 
power  is  still  below  0.9.  A  borderline  test  of  statistical  significance  can  therefore  have 
low  power  against  alternatives  that  are  many  standard  errors  from  zero.  Intuitively,  if 
9  =  2se[0]  then  a  test  of  9  =  0  against  9  =  4se[0]  has  power  of  approximately  0.5, 
because  a  95%  confidence  interval  for  9  is  approximately  (0, 4se[0]),  implying  that 
values  of  9  =  0  or  9  =  4se[$]  are  just  as  likely. 

As  a  more  concrete  example,  suppose  9  measures  the  percentage  increase  in  wage 
resulting  from  a  training  program,  and  that  a  study  finds  9  =  6  with  se[0]  =  4.  Then 
the  Wald  test  at  5%  significance  level  leads  to  nonrejection  of  Hih  since  W  =  (6/4)2  = 
2.25  <  x  os ( 1 )  =  3.96.  The  conclusion  of  such  a  study  will  often  state  that  the  training 
program  is  not  statistically  significant.  One  should  not  interpret  this  as  meaning  that 
there  is  a  high  probability  that  the  training  program  has  no  effect,  however,  as  this  test 
has  low  power.  For  example,  the  preceding  analysis  indicates  that  a  test  of  Ho  :  9  =  0 
against  Ha  :  9  =  16,  a  relatively  large  training  effect,  has  power  of  only  0.5,  since 
4  x  se[0]  =  16.  Reasons  for  low  power  include  small  sample  size,  large  model  error 
variance,  and  small  spread  in  the  regressors. 

In  simple  cases,  solving  the  inverse  problem  of  estimating  the  minimum  sample  size 
needed  to  achieve  a  given  desired  level  of  power  is  possible.  This  is  especially  popular 
in  medical  studies. 

Andrews  (1989)  gives  a  more  formal  treatment  of  using  the  noncentrality  parameter 
to  determine  regions  of  the  parameter  space  against  which  a  test  in  an  empirical  setting 
is  likely  to  have  low  power.  Fie  provides  many  applied  examples  where  it  is  easy  to 
determine  that  tests  have  low  power  against  meaningful  alternatives. 

7.7.  Monte  Carlo  Studies 

Our  discussion  of  statistical  inference  has  so  far  relied  on  asymptotic  results.  For  small 
samples  analytical  results  are  rarely  available,  aside  from  tests  of  linear  restrictions  in 
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the  linear  regression  model  under  normality.  Small-sample  results  can  nonetheless  be 
obtained  by  performing  a  Monte  Carlo  study. 

7.7.1.  Overview 

An  example  of  a  Monte  Carlo  study  of  the  small-sample  properties  of  a  test  statistic  is 
the  following.  Set  the  sample  size  N  to  40,  say,  and  randomly  generate  10,000  samples 
of  size  40  under  the  Hq  model.  For  each  replication  (sample)  form  the  test  statistic  of 
interest  and  test  Hq,  rejecting  Hq  if  the  test  statistic  falls  in  the  rejection  region,  usually 
determined  by  asymptotic  results. 

The  true  size  or  actual  size  of  the  test  statistic  is  simply  the  fraction  of  replications 
for  which  the  test  statistic  falls  in  the  rejection  region.  Ideally,  this  is  close  to  the 
nominal  size,  which  is  the  chosen  significance  level  of  the  test.  For  example,  if  testing 
at  5%  the  nominal  test  size  is  0.05  and  the  true  size  is  hopefully  close  to  0.05. 

Determining  test  power  in  small  samples  requires  additional  simulation,  with  sam¬ 
ples  generated  under  one  or  more  particular  specification  of  the  possible  models  that 
lie  in  the  composite  alternative  hypothesis  Ha .  The  power  is  calculated  as  the  fraction 
of  replications  for  that  the  null  hypothesis  is  rejected,  using  either  the  same  test  as  used 
in  determining  the  true  size,  or  a  size-corrected  version  of  the  test  that  uses  a  rejection 
region  such  that  the  nominal  size  equals  the  true  size. 

Monte  Carlo  studies  are  simple  to  implement,  but  there  are  many  subtleties  involved 
in  designing  a  good  Monte  Carlo  study.  For  an  excellent  discussion  see  Davidson  and 
MacKinnon  (1993). 


7.7.2.  Monte  Carlo  Details 

As  an  example  of  a  Monte  Carlo  study  we  consider  statistical  inference  on  the  slope 
coefficient  in  a  probit  model.  The  following  analysis  does  not  rely  on  knowledge  of 
the  probit  model. 

The  data-generating  process  is  a  probit  model,  with  binary  regressor  y  equal  to  one 
with  probability 

Pr[y  =  l|x]  =  <t»(/h  +  fi2x), 

where  <!>(•)  is  the  standard  normal  cdf,  x  ~  A/"[0,  1],  and  (fJ>\ ,  /T)  =  (0,  1). 

The  data  (y,  x)  are  easily  generated  for  this  dgp.  The  regressor  x  is  first  obtained  as 
a  random  draw  from  the  standard  normal  distribution.  Then,  from  Section  14.4.2  the 
dependent  variable  y  is  set  equal  to  1  if  x  +  u  >  0  and  is  set  to  0  otherwise,  where  u 
is  a  random  draw  from  the  standard  normal.  For  this  dgp  y  =  1  roughly  half  the  time 
and  y  =  0  the  other  half. 

In  each  simulation  N  new  observations  of  both  x  and  y  are  drawn,  and  the  MLE 
from  probit  regression  of  y  on  x  is  obtained.  An  alternative  is  to  use  the  same  N  draws 
of  the  regressor  x  in  each  simulation  and  only  redraw  y.  The  former  setup  corresponds 
to  simple  random  sampling  and  the  latter  corresponds  to  analysis  conditional  on  x  or 
“fixed  in  repeated  trials”;  see  Section  4.4.7. 

Monte  Carlo  studies  often  consider  a  range  of  sample  sizes.  Here  we  simply 
set  N  =  40.  Programs  can  be  checked  by  also  setting  a  very  large  value  of  N, 
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say  N  =  10,000,  as  then  Monte  Carlo  results  should  be  very  close  to  asymptotic 
results. 

Numerous  simulations  are  needed  to  determine  actual  test  size,  because  this  de¬ 
pends  on  behavior  in  the  tails  of  the  distribution  rather  than  the  center.  If  S  simulations 
are  run  for  a  test  of  true  size  a,  then  the  proportion  of  times  the  null  hypothesis  is 
correctly  rejected  is  an  outcome  from  S  binomial  trials  with  mean  a  and  variance 
a(l  —  a)/S.  So  95%  of  Monte  Carlos  will  estimate  the  test  size  to  be  in  the  inter¬ 
val  a  ±  1 ,96^/aTT  —  a)/S.  A  mere  100  simulations  is  not  enough  since,  for  example, 
this  interval  is  (0.007,  0.093)  when  a  =  0.05.  For  10,000  simulations  the  95%  inter¬ 
val  is  much  more  precise,  equalling  (0.008,  0.012),  (0.046,  0.054),  (0.094,  0.106),  and 
(0.192,  0.208)  for  a  equal  to,  respectively,  0.01,  0.05,  0.10,  and  0.20.  Here  S  =  10,000 
simulations  are  used. 

A  problem  that  can  arise  in  Monte  Carlo  simulations  is  that  for  some  simulation 
samples  the  model  may  not  be  estimable.  For  example,  consider  linear  regression  on 
just  an  intercept  and  an  indicator  variable.  If  the  indicator  variable  happens  to  always 
take  the  same  value,  say  0,  in  a  simulation  sample  then  its  coefficient  cannot  be  sepa¬ 
rately  identified  from  that  for  the  intercept.  A  similar  problem  arises  in  the  probit  and 
other  binary  outcome  models,  if  all  ys  are  0  or  all  ys  are  1  in  a  simulation  sample.  The 
standard  procedure,  which  can  be  criticized,  is  to  drop  such  simulation  samples,  and  to 
write  computer  code  that  permits  the  simulation  loop  to  continue  when  such  a  problem 
arises.  In  this  example  the  problem  did  not  arise  with  N  =  40,  but  it  did  for  N  =  30. 

7.7.3.  Small-Sample  Bias 

Before  moving  to  testing  we  look  at  the  small-sample  properties  of  the  MLE  ji2  and 
its  estimated  standard  error  se[/j2L 

Across  the  10,000  simulations  /?2  had  mean  1.201  and  standard  deviation  0.452, 
whereas  se[/i2]  had  mean  0.359.  The  MLE  is  therefore  biased  upward  in  small  sam¬ 
ples,  as  the  average  of  /32  is  considerably  greater  than  /J2  =  1.  The  standard  errors  are 
biased  downward  in  small  samples  since  the  average  of  se[/L  I  is  considerably  smaller 
than  the  standard  deviation  of 


1.1  A.  Test  Size 


We  consider  a  two-sided  test  of  Hq  :  fi2  =  1  against  H„  :  fJ>2  ^  I .  using  the  Wald  test 


z  =  W. 


1 

se[yS2]  ' 


where  se[/J2]  is  the  standard  error  of  the  MLE  estimated  using  the  variance  matrix 
given  in  Section  14.3.2,  which  is  minus  the  inverse  of  the  expected  Hessian.  Given  the 
dgp,  asymptotically  z  is  standard  normal  distributed  and  z?  is  chi-squared  distributed. 
The  goal  is  to  find  how  well  this  approximates  the  small-sample  distribution. 

Figure  7.2  gives  the  density  for  the  S  =  10,000  computed  values  of  z,  where  the  den¬ 
sity  is  plotted  using  the  kernel  density  estimate  of  Chapter  9  rather  than  a  histogram. 
This  is  superimposed  on  the  standard  normal  density.  Clearly  the  asymptotic  result  is 
not  exact,  especially  in  the  upper  tail  where  the  difference  is  clearly  large  enough  to 
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Table  7.2.  Wald  Test  Size  and  Power  for  Probit  Regression  Example a 


Nominal  Size  (a) 

Actual  Size 

Actual  Power 

Asymptotic  Power 

0.01 

0.005 

0.007 

0.272 

0.05 

0.029 

0.226 

0.504 

0.10 

0.081 

0.608 

0.628 

0.20 

0.192 

0.858 

0.755 

‘  The  dgp  for  y  is  the  Probit  with  Pr[y  =  1]  =  <t>(0  +  ft  a)  and  sample  size  N  =  40.  The  test  is  a  two- 
sided  Wald  test  of  whether  or  not  the  slope  coefficient  equals  1 .  Actual  size  is  calculated  from  S  = 
10,000  simulations  with  @2  =  1  and  power  is  calculated  from  10,000  simulations  with  ft  =  2. 


lead  to  size  distortions  when  testing  at,  say,  5%.  Also,  across  the  simulations  z  has 
mean  0.1 14  7^  0  and  standard  deviation  0.956  f=-  1. 

The  first  two  columns  of  Table  7.2  give  the  nominal  size  and  the  actual  size  of 
the  Wald  test  for  nominal  sizes  a  =  0.01, 0.05,  0.10,  and  0.20.  The  actual  size  is  the 
proportion  of  the  10,000  simulations  in  which  |z[  >  za/i,  or  equivalently  that  z2> 
X2(l).  Clearly  the  actual  size  of  the  test  is  much  less  than  the  nominal  size  for  a  < 
0.10.  An  ad  hoc  small-sample  correction  is  to  instead  assume  that  z  is  t  distributed 
with  38  degrees  of  freedom,  and  reject  if  |z[  >  C/2 (3 8).  However,  this  leads  to  even 
smaller  actual  size,  since  4/2(38)  >  za/i- 

The  Monte  Carlo  simulations  can  also  be  used  to  obtain  size-corrected  critical  val¬ 
ues.  Thus  the  lower  and  upper  2.5  percentiles  of  the  10,000  simulated  values  of  z  are 
—  1 .905  and  2.003.  It  follows  that  an  asymmetric  rejection  region  with  actual  size  0.05 
is  z  <  —1.905  and  z  >  2.003,  a  larger  rejection  region  than  \zf\  >  1.960. 


7.7.5.  Test  Power 

We  consider  power  of  the  Wald  test  under  Ha  :  ff  =  2.  We  would  expect  the  power  to 
be  reasonable  because  this  value  of  /3 2  lies  two  to  three  standard  errors  away  from  the 


Monte  Carlo  Simulations  of  Wald  Test 


Wald  Test  Statistic 

Figure  7.2:  Density  of  Wald  test  statistic  that  slope  coefficient  equals  one  computed  by 
Monte  Carlo  simulation  with  standard  normal  density  also  plotted  for  comparison.  Data  are 
generated  from  a  probit  regression  model. 
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null  hypothesis  value  of  =  1,  given  that  se[/l2]  has  average  value  0.359.  The  actual 
and  nominal  power  of  the  Wald  test  are  given  in  the  last  two  columns  of  Table  7.2. 

The  actual  power  is  obtained  in  the  same  way  as  actual  size,  being  the  proportion 
of  the  10,000  simulations  in  which  |z|  >  za/2-  The  only  change  is  that,  in  generating  y 
in  the  simulation,  fto  =  2  rather  than  1.  The  actual  power  is  very  low  for  a  =  0.01  and 
0.05,  cases  where  the  actual  size  is  much  less  than  the  nominal  size. 

The  nominal  power  of  the  Wald  test  is  determined  using  the  asymptotic  non¬ 
central  x2(l,  7.)  distribution  under  H„,  where  from  (7.50)  X  =  / y/N)2  /  s&ifti]2  = 

\  x  l2/0.3592  ~  3.88,  since  the  local  alternative  is  that  Ha  :  fi2  —  1  =  S/y/ ~N,  so 
S/'s/N  =  1  for  =  2.  The  asymptotic  result  is  not  exact,  but  it  does  provide  a  useful 
estimate  of  the  power  for  a  =  0.10  and  0.20,  cases  where  the  true  size  closely  matches 
the  nominal  size. 


7.7.6.  Monte  Carlo  in  Practice 

The  preceding  discussion  has  emphasized  use  of  the  Monte  Carlo  analysis  to  calculate 
test  power  and  size.  A  Monte  Carlo  analysis  can  also  be  very  useful  for  determining 
small-sample  bias  in  an  estimator  and,  by  setting  N  large,  for  determining  that  an 
estimator  is  actually  consistent.  Such  Monte  Carlo  routines  are  very  simple  to  run 
using  current  computer  packages. 

A  Monte  Carlo  analysis  can  be  applied  to  real  data  if  the  conditional  distribution 
of  y  given  x  is  fully  parametrized.  For  example,  consider  a  probit  model  estimated 
with  real  data.  In  each  simulation  the  regressors  are  set  at  their  sample  values,  if  the 
sampling  framework  is  one  of  fixed  regressors  in  repeated  samples,  while  a  new  set  of 
values  for  the  binary  dependent  variable  y  needs  to  be  generated.  This  will  depend  on 
what  values  of  the  parameters  (3  are  used.  Let  ft , , ft  K  denote  the  probit  estimates 
from  the  original  sample  and  consider  a  Wald  test  of  Hf)  :  ftj  =  0.  To  calculate  test  size, 
generate  S  simulation  samples  by  setting  ftk  =  ftk  for  j  7^  k  and  setting  ft  j  =  0,  and 
then  calculate  the  proportion  of  simulations  in  which  Hq  ■  ftj  =  0  is  rejected.  To  esti¬ 
mate  the  power  of  the  Wald  test  against  a  specific  alternative  Ha  :  ftj  =  1,  say,  generate 
y  with  fa  =  ftk  for  j  k  and  ft j  =  1  in  generating  y,  and  calculate  the  proportion  of 
simulations  in  which  Hq  '■  ftj  =  0  is  rejected. 

In  practice  much  microeconometric  analysis  is  based  on  estimators  that  are  not 
based  on  fully  parametric  models.  Then  additional  distributional  assumptions  are 
needed  to  perform  a  Monte  Carlo  analysis. 

Alternatively,  power  can  be  calculated  using  asymptotic  methods  rather  than  finite- 
sample  methods.  Additionally  the  bootstrap,  presented  next,  can  be  used  to  obtain  size 
using  a  more  refined  asymptotic  theory. 


7.8.  Bootstrap  Example 

The  bootstrap  is  a  variant  of  Monte  Carlo  simulation  that  has  the  attraction  of  being 
implementable  with  fewer  parametric  assumptions  and  with  little  additional  program 
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code  beyond  that  required  to  estimate  the  model  in  the  first  place.  Essential  ingredients 
for  the  bootstrap  to  be  valid  are  that  the  estimator  actually  has  a  limit  distribution  and 
that  the  bootstrap  resamples  quantities  that  are  iid. 

The  bootstrap  has  two  general  uses.  First,  it  can  be  used  as  an  alternative  way  to 
compute  statistics  without  asymptotic  refinement.  This  is  particularly  useful  for  com¬ 
puting  standard  errors  when  analytical  formulas  are  complex.  Second,  it  can  be  used 
to  implement  a  refinement  of  the  usual  asymptotic  theory  that  may  provide  a  better 
finite-sample  approximation  to  the  distribution  of  test  statistics. 

We  illustrate  the  bootstrap  to  implement  a  Wald  test,  ahead  of  a  complete  treatment 
in  Chapter  1 1 . 


7.8.1.  Inference  Using  Standard  Asymptotics 

Consider  again  a  probit  example  with  binary  regressor  y  equal  to  one  with  probability 
p  =  <t>(y  +  (lx),  where  <t>0)  is  the  standard  normal  cdf.  Interest  lies  in  testing  Hq  : 
(3  =  1  against  Ha  :  (3  1  at  significance  level  0.05.  The  analysis  here  does  not  require 

knowledge  of  the  probit  model. 

One  sample  of  size  N  =  30  is  generated.  Probit  ML  estimation  yields  (3  =  0.817 
and  sj j  =  0.294,  where  the  standard  error  is  based  on  —A-1,  so  the  test  statistic  z  = 
(1  -0.817)/0.294=  -0.623. 

Using  standard  asymptotic  theory  we  obtain  5%  critical  values  of  —1.96  and  1.96, 
since  z .025  =  1-96,  and  H0  is  not  rejected. 


7.8.2.  Bootstrap  without  Asymptotic  Refinement 

The  departure  point  of  the  bootstrap  method  is  to  resample  from  an  approximation  to 
the  population;  see  Section  11.2.1.  The  paired  bootstrap  does  so  by  resampling  from 
the  original  sample. 

Thus  form  B  pseudo-samples  of  size  N  by  drawing  with  replacement  from  the  orig¬ 
inal  data  {(yi,Xi),  i  =  1, . . . ,  N}.  For  example,  the  first  pseudo-sample  of  size  30  may 
have  (yi,  xi)  once,  (yx,  xi)  not  at  all,  (>’3,  X3)  twice,  and  so  on.  This  yields  B  estimates 
/S ! ,  . ...  (1 B  of  the  parameter  of  interest  / 1 ,  that  can  be  used  to  estimate  features  of  the 
distribution  of  the  original  estimate  (3. 

For  example,  suppose  the  computer  program  used  to  estimate  a  probit  model  reports 
/!  but  not  the  standard  error  sj, .  The  bootstrap  solves  this  problem  since  we  can  use 
the  estimated  standard  deviation  boot  of  yd*, ... ,  (3B  from  the  B  bootstrap  pseudo- 
samples.  Given  this  standard  error  estimate  it  is  possible  to  perform  a  Wald  hypothesis 
test  on  /!. 

For  the  probit  Wald  test  example,  the  resulting  bootstrap  estimate  of  the  standard 
error  of  yd  is  0.376,  leading  to  z  =  (1  —  0.8 17)/0.376  =  —0.487.  Since  —0.487  lies  in 
(—1.96,  1.96)  we  do  not  reject  Hq  at  5%. 

This  use  of  the  bootstrap  to  test  hypotheses  does  not  lead  to  size  improvements  in 
small  samples.  However,  it  can  lead  to  great  time  savings  in  many  applications  if  it  is 
difficult  to  otherwise  obtain  the  standard  errors  for  an  estimator. 
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7.8.3.  Bootstrap  with  Asymptotic  Refinement 

Some  bootstraps  can  lead  to  a  better  asymptotic  approximation  to  the  distribution  of 
z.  This  is  likely  to  lead  to  finite-sample  critical  values  that  are  better  in  the  sense  that 
the  actual  size  is  likely  to  be  closer  to  the  nominal  size  of  0.05.  Details  are  provided  in 
Chapter  1 1 .  Here  we  illustrate  the  method. 

Again  form  B  pseudo-samples  of  size  N  by  drawing  with  replacement  from  the 
original  data.  Estimate  the  probit  model  in  each  pseudo-sample  and  for  the  bth 
pseudo-sample  compute  z*h  =  (f>h  —  P ) /sj,* ,  where  ft  is  the  original  estimate.  The 
bootstrap  distribution  for  the  original  test  statistic  z  is  then  the  empirical  distribution 
of  z*, . . . ,  Z*B  rather  than  the  standard  normal.  The  lower  and  upper  2.5  percentiles  of 
this  empirical  distribution  give  the  bootstrap  critical  values. 

For  the  example  here  with  B  =  1,000  the  lower  and  upper  2.5  percentiles  of  the 
empirical  bootstrap  distribution  of  z  were  found  to  be  —2.62  and  1.83.  The  bootstrap 
critical  values  for  testing  at  5%  are  then  —2.62  and  1.83,  rather  than  the  usual  ±1.96. 
Since  the  initial  sample  test  statistic  z  =  —0.623  lies  in  (—2.62,  1 .83)  we  do  not  reject 
Hq  :  p  =  1.  A  bootstrap  p— value  can  also  be  computed. 

Unlike  the  bootstrap  in  the  previous  section,  an  asymptotic  improvement  occurs 
here  because  the  studentized  test  statistic  z  is  asymptotically  pivotal  (see  Section 
1 1.2.3)  whereas  the  estimator  p  is  not. 

7.9.  Practical  Considerations 

Microeconometrics  research  places  emphasis  on  statistical  inference  based  on  min¬ 
imal  distributional  assumptions,  using  robust  estimates  of  the  variance  matrix  of  an 
estimator.  There  is  no  sense  in  robust  inference,  however,  if  failure  of  distributional 
assumptions  leads  to  the  more  serious  complication  of  estimator  inconsistency  as  can 
happen  for  some  though  not  all  ML  estimators. 

Many  packages  provide  a  “robust”  standard  errors  option  in  estimator  commands. 
In  micreconometrics  packages  robust  often  means  heteroskedastic  consistent  and  does 
not  guard  against  other  complications  such  as  clustering,  see  Section  24.5,  that  can 
also  lead  to  invalid  statistical  inference. 

Robust  inference  is  usually  implemented  using  a  Wald  test.  The  Wald  test  has  the 
weakness  of  invariance  to  reparametrization  of  nonlinear  hypotheses,  though  this  may 
be  diminished  by  performing  an  appropriate  bootstrap.  Standard  auxiliary  regressions 
for  the  LM  test  and  implementations  of  LM  tests  on  computer  packages  are  usually 
not  robustified,  though  in  some  cases  relatively  simple  robustification  of  the  LM  test 
is  possible  (see  Section  8.4). 

The  power  of  tests  can  be  weak.  Ideally,  power  against  some  meaningful  alternative 
would  be  reported.  Failing  this,  as  Section  7.6  indicates,  one  should  be  careful  about 
overstating  the  conclusions  from  a  hypothesis  test  unless  parameters  are  very  precisely 
estimated. 

The  finite  sample  size  of  tests  derived  from  asymptotic  theory  is  also  an  issue.  The 
bootstrap  method,  detailed  in  Chapter  1 1 ,  has  the  potential  to  yield  hypothesis  tests 
and  confidence  intervals  with  much  better  hnite-sample  properties. 
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Statistical  inference  can  be  quite  fragile,  so  these  issues  are  of  importance  to  the 
practitioner.  Consider  a  two-tailed  Wald  test  of  statistical  significance  when  6  =  1 .96, 
and  assume  the  test  statistic  is  indeed  standard  normal  distributed.  If  Sg  =  1.0  then 
t  =  1.96  and  the  p— value  is  0.050.  However,  the  true  p— value  is  a  much  higher  0.117 
if  the  standard  error  was  underestimated  by  20%  (so  correct  t  =  1.57),  and  a  much 
lower  0.014  if  the  standard  error  was  overestimated  by  20%  (so  t  =  2.35). 

7.10.  Bibliographic  Notes 

The  econometrics  texts  by  Gourieroux  and  Monfort  (1989)  and  Davidson  and  MacKinnon 
(1993)  give  quite  lengthy  treatment  of  hypothesis  testing.  The  presentation  here  considers  only 
equality  restrictions.  For  tests  of  inequality  restrictions  see  Gourieroux,  Holly,  and  Monfort 
(1982)  for  the  linear  case  and  Wolak  (1991)  for  the  nonlinear  case.  For  hypothesis  testing  when 
the  parameters  are  at  the  boundary  of  the  parameter  space  under  the  null  hypothesis  the  tests 
can  break  down;  see  Andrews  (2001). 

7.3  A  useful  graphical  treatment  of  the  three  classical  test  procedures  is  given  by  Buse  (1982). 

7.5  Newey  and  West  (1987a)  present  extension  of  the  classical  tests  to  GMM  estimation. 

7.6  Davidson  and  MacKinnon  (1993)  give  considerable  discussion  of  power  and  explain  the 
distinction  between  explicit  and  implicit  null  and  alternative  hypotheses. 

7.7  For  Monte  Carlo  studies  see  Davidson  and  MacKinnon  (1993)  and  Hendry  (1984). 

7.8  The  bootstrap  method  due  to  Efron  (1979)  is  detailed  in  Chapter  1 1 . 


- Exercises - 

7-1  Suppose  a  sample  yields  estimates  ?-i  =  5,  d2  =  3  with  asymptotic  variance  es¬ 
timates  4  and  2  and  the  correlation  coefficient  between  9-\  and  9 2  equals  0.5. 
Assume  asymptotic  normality  of  the  parameter  estimates. 

(a)  Test  H0  :  A|e®2  =  100  against  Ha  :  9i  #  100  at  level  0.05. 

(b)  Obtain  a  95%  confidence  interval  for  y  =  9i  e®2. 

7-2  Consider  NLS  regression  for  the  model  y  =  exp(a  +  px)  +  e,  where  a,  p,  and 
x  are  scalars  and  e  ~  A([0, 1].  Note  that  for  simplicity  af  =  1  and  need  not  be 
estimated.  We  want  to  test  H0  :  p  =  0  against  Ha  :  p  ^  0. 

(a)  Give  the  first-order  conditions  for  the  unrestricted  MLE  of  a  and  p. 

(b)  Give  the  asymptotic  variance  matrix  for  the  unrestricted  MLE  of  a  and  p. 

(c)  Give  the  explicit  solution  for  the  restricted  MLE  of  a  and  p. 

(d)  Give  the  auxiliary  regression  to  compute  the  OPG  form  of  the  LM  test. 

(e)  Give  the  complete  expression  for  the  original  form  of  the  LM  test.  Note  that 
it  involves  derivatives  of  the  unrestricted  log-likelihood  evaluated  at  the  re¬ 
stricted  MLE  of  a  and  p.  [This  is  more  difficult  than  parts  (a)-(d).] 

7-3  Suppose  we  wish  to  choose  between  two  nested  parametric  models.  The  relation¬ 
ship  between  the  densities  of  the  two  models  is  that  g(y\x,p,a  =  0)  =  f(y\x,p), 
where  for  simplicity  both  p  and  a  are  scalars.  If  g  is  the  correct  density  then  the 
MLE  of  p  based  on  density  f  is  inconsistent.  A  test  of  model  f  against  model 
g  is  a  test  of  H0  :  a  =  0  against  Ha  :  a  /  0.  Suppose  ML  estimation  yields  the 
following  results:  (1)  model  f:  p  =  5.0,  se[,6]  =  0.5,  and  In  L  =  -106;  (2)  model 
g:  p  =  3.0,  se[/f]  =  1.0,  a  =  2.5,  se[<5]  =  1.0,  and  In  L  =  -103.  Not  all  of  the 
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following  tests  are  possible  given  the  preceding  information.  If  there  is  enough 
information,  perform  the  tests  and  state  your  conclusions.  If  there  is  not  enough 
information,  then  state  this. 

(a)  Perform  a  Wald  test  of  H0  at  level  0.05. 

(b)  Perform  a  Lagrange  multiplier  test  of  H0  at  level  0.05. 

(c)  Perform  a  likelihood  ratio  test  of  H0  at  level  0.05. 

(d)  Perform  a  Hausman  test  of  H0  at  level  0.05. 

7-4  Consider  test  of  H0  \  n  =  0  against  Ha  :  [i  ^  0  at  nominal  size  0.05  when  the 
dgp  is  y~Af[/z,  100],  so  the  standard  deviation  is  10,  and  the  sample  size  is 
N  =  10.  The  test  statistic  is  the  usual  Mest  statistic  t  =  jl/^s/ 10,  where  s2  = 
(1  /9)  J2i(Yi  ~  Vt-  Perform  1 ,000  simulations  to  answer  the  following. 

(a)  Obtain  the  actual  size  of  the  t- test  if  the  correct  finite-sample  critical  values 
if 025 (8)  =  ±2.306  are  used.  Is  there  size  distortion? 

(b)  Obtain  the  actual  size  of  the  f-test  if  the  asymptotic  approximation  critical 
values  ±z025  =  ±1.960  are  used.  Is  there  size  distortion? 

(c)  Obtain  the  power  of  the  f-test  against  the  alternative  Ha  :  /z  =  1 ,  when  the 
critical  values  ±f0 25(8)  =  ±2.306  are  used.  Is  the  test  powerful  against  this 
particular  alternative? 

7-5  Use  the  health  expenditure  data  of  Section  16.6.  The  model  is  a  probit  regression 
of  DMED,  an  indicator  variable  for  positive  health  expenditures,  against  the  17 
regressors  listed  in  the  second  paragraph  of  Section  16.6.  You  should  obtain  the 
estimates  given  in  the  first  column  of  Table  16.1 .  Consider  joint  test  of  the  statisti¬ 
cal  significance  of  the  self-rated  health  indicators  HLTHG,  HLTHF,  and  HLTHP  at 
level  0.05. 

(a)  Perform  a  Wald  test. 

(b)  Perform  a  likelihood  ratio  test. 

(c)  Perform  an  auxiliary  regression  to  implement  an  LM  test.  [This  will  require 
some  additional  coding.] 
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Specification  Tests  and  Model 
Selection 


8.1.  Introduction 

Two  important  practical  aspects  of  microeconometric  modeling  are  determining 
whether  a  model  is  correctly  specified  and  selecting  from  alternative  models.  For  these 
purposes  it  is  often  possible  to  use  the  hypothesis  testing  methods  presented  in  the  pre¬ 
vious  chapter,  especially  when  models  are  nested.  In  this  chapter  we  present  several 
other  methods. 

First,  m-tests  such  as  conditional  moment  tests  are  tests  of  whether  moment  con¬ 
ditions  imposed  by  a  model  are  satisfied.  The  approach  is  similar  in  spirit  to  GMM, 
except  that  the  moment  conditions  are  not  imposed  in  estimation  and  are  instead  used 
for  testing.  Such  tests  are  conceptually  very  different  from  the  hypothesis  tests  of 
Chapter  7,  as  there  is  no  explicit  statement  of  an  alternative  hypothesis  model. 

Second,  Hausman  tests  are  tests  of  the  difference  between  two  estimators  that  are 
both  consistent  if  the  model  is  correctly  specified  but  diverge  if  the  model  is  incorrectly 
specified. 

Third,  tests  of  nonnested  models  require  special  methods  because  the  usual  hypoth¬ 
esis  testing  approach  can  only  be  applied  when  one  model  is  nested  within  another. 

Finally,  it  can  be  useful  to  compute  and  report  statistics  of  model  adequacy  that  are 
not  test  statistics.  For  example,  an  analogue  of  R 2  may  be  used  to  measure  the  good¬ 
ness  of  fit  of  a  nonlinear  model. 

Ideally,  these  methods  are  used  in  a  cycle  of  model  specification,  estimating,  testing, 
and  evaluation.  This  cycle  can  move  from  a  general  model  toward  a  specific  model,  or 
from  a  specific  model  to  a  more  general  one  that  is  felt  to  capture  the  most  important 
features  of  the  data. 

Section  8.2  presents  m-tests,  including  conditional  moment  tests,  the  information 
matrix  test,  and  chi-square  goodness  of  fit  tests.  The  Hausman  test  is  presented  in 
Section  8.3.  Tests  for  several  common  misspecifications  are  discussed  in  Section  8.4. 
Discrimination  between  nonnested  models  is  the  focus  of  Section  8.5.  Commonly  used 
convenient  implementations  of  the  tests  of  Sections  8.2-8. 5  can  rely  on  strong  distri¬ 
butions  and/or  perform  poorly  in  finite  samples.  These  concerns  have  discouraged  use 
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of  some  of  these  tests,  but  such  concerns  are  outdated  because  in  many  cases  the  boot¬ 
strap  methods  presented  in  Chapter  11  can  correct  for  these  weaknesses.  Section  8.6 
considers  the  consequences  of  testing  a  model  on  subsequent  inference.  Model  diag¬ 
nostics  are  presented  in  the  stand-alone  Section  8.7. 


8.2.  m- Tests 

m-Tests,  such  as  conditional  moment  tests,  are  a  general  specification  testing  proce¬ 
dure  that  encompasses  many  common  specification  tests.  The  tests  are  easily  imple¬ 
mented  using  auxiliary  regressions  when  estimation  is  by  ML,  a  situation  where  tests 
of  model  assumptions  are  especially  desirable.  Implementation  is  usually  more  diffi¬ 
cult  when  estimators  are  instead  based  on  minimal  distributional  assumptions. 

We  first  introduce  the  test  statistic  and  computational  methods,  followed  by  leading 
examples  and  an  illustration  of  the  tests. 

8.2.1.  m-Test  Statistic 

Suppose  a  model  implies  the  population  moment  condition 

Ho  :  E[m,-(w;,  0)]  —  0,  (8.1) 

where  w  is  a  vector  of  observables,  usually  the  dependent  variable  y  and  regressors 
x  and  sometimes  additional  variables  z,  6  is  a  q  x  1  vector  of  parameters,  and  m,  (-) 
is  an  h  x  1  vector.  A  simple  example  is  that  E[(y  —  x'( 3)z]  =  0  if  z  can  be  omitted  in 
the  linear  model  y  =  x' ft  +  u.  Especially  for  fully  parametric  models  there  are  many 
candidates  for  m,(-). 

An  m-test  is  a  test  of  the  closeness  to  zero  of  the  corresponding  sample  moment 

N 

mN(d)  =  IV'1  ^m,(w,',  6).  (8.2) 

i= 1 

This  approach  is  similar  to  that  for  the  Wald  test,  where  h(0)  =  0  is  tested  by  testing 
the  closeness  to  zero  of  h(0). 

A  test  statistic  is  obtained  by  a  method  similar  to  that  detailed  in  Section  7.2.4  for 
the  Wald  test.  In  Section  8.2.3  it  is  shown  that  if  (8.1)  holds  then 

VivSidi)  4  Af[0,  Vm],  (8.3) 

where  Vm  defined  later  in  (8.10)  is  more  complicated  than  in  the  case  of  the  Wald  test 
because  m,  (w, ,  0)  has  two  sources  of  stochastic  variation  as  both  w,  and  9  are  random. 

A  chi-square  test  statistic  can  then  be  obtained  by  taking  the  corresponding 
quadratic  form.  Thus  the  m-test  statistic  for  (8.1)  is 

M  =  NmN(9)'VmlmN(9),  (8.4) 

which  is  asymptotically  /2(rank[Vm])  distributed  if  the  moment  conditions  (8.1)  are 
correct.  An  m-test  rejects  the  moment  conditions  (8.1)  at  significance  level  a  if  M  > 
Xa(h)  and  does  not  reject  otherwise. 
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A  complication  is  that  Vm  may  not  be  of  full  rank  h.  For  example,  this  is  the  case 
if  the  estimator  0  itself  sets  a  linear  combination  of  components  of  mw(0)  to  0.  In 
some  cases,  such  as  the  OIR  test,  Vm  is  still  of  full  rank  and  M  can  be  computed  but 
the  chi-square  test  statistic  has  only  rank[Vm]  degrees  of  freedom.  In  other  cases  Vm 
itself  is  not  of  full  rank.  Then  it  is  simplest  to  drop  (h  —  rank[Ym])  of  the  moment 
conditions  and  perform  an  m-test  using  just  this  subset  of  the  moment  conditions.  Al¬ 
ternatively,  the  full  set  of  moment  conditions  can  be  used,  but  V”1  in  (8.4)  is  replaced 
by  Vm,  the  generalized  inverse  of  Vm.  The  Moore-Penrose  generalized  inverse  V 
of  a  matrix  V  satisfies  VV~V  =  V,  VVV“  =  V",  (VV)'  =  VV“,  and  (V"V)'  = 
V  V.  When  Vm  is  less  than  full  rank  then  strictly  speaking  (8.3)  no  longer  helds, 
since  the  multivariate  normal  requires  full  rank  Vm,  but  (8.4)  still  holds  given  these 
adjustments. 

The  m-test  approach  is  conceptually  very  simple.  The  moment  restriction  (8.1)  is 
rejected  if  a  quadratic  form  in  the  sample  estimate  (8.2)  is  far  enough  from  zero.  The 
challenges  are  in  calculating  M  since  Vm  can  be  quite  complex  (see  Section  8.2.2), 
selecting  moments  m()  to  test  (see  Sections  8. 2. 3-8. 2. 6  for  leading  examples),  and 
interpreting  reasons  for  rejection  of  (8.1)  (see  Section  8.2.8). 


8.2.2.  Computation  of  the  m-Statistic 

There  are  several  ways  to  compute  the  m-statistic. 

First,  one  can  always  directly  compute  Vm,  and  hence  M,  using  the  consistent  es¬ 
timates  of  the  components  of  Vm  given  in  Section  8.2.3.  Most  practitioners  shy  away 
from  this  approach  as  it  entails  matrix  computations. 

Second,  the  bootstrap  can  always  be  used  (see  Section  1 1.6.3),  since  the  bootstrap 
can  provide  an  estimate  of  Vm  that  controls  for  all  sources  of  variation  in  m ,y ( 0 )  = 
N~l  J2i  m/ (w,- ,  6). 

Third,  in  some  cases  auxiliary  regressions  similar  to  those  for  the  LM  test  given 
in  Section  7.3.5  can  be  run  to  compute  asymptotically  equivalent  versions  of  M  that 
do  not  require  computation  of  Vm.  These  auxiliary  regressions  may  in  turn  be  boot¬ 
strapped  to  obtain  an  asymptotic  refinement  (see  Section  1 1.6.3).  We  present  several 
leading  auxiliary  regressions. 


Auxiliary  Regressions  Using  the  ML  Estimator 

Model  specification  tests  are  especially  desirable  when  inference  is  done  within  the 
likelihood  framework,  as  in  general  any  misspecification  of  the  density  can  lead  to  in¬ 
consistency  of  the  MLE.  Fortunately,  an  m-test  is  easily  implemented  when  estimation 
is  by  maximum  likelihood. 

Specifically,  when  6  is  the  MLE,  generalizing  the  LM  test  result  of  Section  7.3.5 
(see  Section  8.2.3)  yields  an  asymptotically  equivalent  version  of  the  m-test  is  obtained 
from  the  auxiliary  regression 


1  =  mjc5  +  s/7  +  I/,-, 


(8.5) 
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where  m,  =  m,(>’< ,  x,-,  0ml),  si  =  3  In /(y,  |x,- ,  9)/d9 |gML  is  the  contribution  of  the 
/ th  observation  to  the  score  and  /(y,  |x,  .  9)  is  the  conditional  density  function,  by 
calculating 

M*  =  NR2u,  (8.6) 

where  R2  is  the  uncentered  R2  defined  at  the  end  of  Section  7.3.5.  Equivalently,  M* 
equals  ESS„,  the  uncentered  explained  sum  of  squares  (the  sum  of  squares  of  the  fitted 
values)  from  regression  (8.5),  or  M*  equals  N  —  RSS,  where  RSS  is  the  residual  sum 
of  squares  from  regression  (8.5).  M*  is  asymptotically  y  2(/r)  under  Hq. 

The  test  statistic  M*  is  called  the  outer  product  of  the  gradient  form  of  the  m-test, 
and  it  is  a  generalization  of  the  auxiliary  regression  for  the  LM  test  (see  Section  7.3.5). 
Although  the  OPG  form  can  be  easily  computed,  it  has  poor  small-sample  properties 
with  large  size  distortions.  Similar  to  the  LM  test,  however,  these  small-sample  prob¬ 
lems  can  be  greatly  reduced  by  using  bootstrap  methods  (see  Section  1 1.6.3). 

The  test  statistic  M*  may  also  be  appropriate  in  some  non-ML  settings.  The  auxil¬ 
iary  regression  is  applicable  whenever  E[3m/30  ]  =  —  E[ms']  (see  Section  8.2.3).  By 
the  generalized  IM  equality  (see  Section  5.6.3),  this  condition  holds  for  the  MLE  when 
expectation  is  with  respect  to  the  specified  density  /(•)•  It  can  also  hold  under  weaker 
distributional  assumptions  in  some  cases. 


Auxiliary  Regressions  When  E[  i)m/H9'\  =  0 
In  some  applications  m,  (w,  ,  9)  satisfies 

E[9m,(w/,0)/30,|J  =0,  (8.7) 

in  addition  to  (8.1). 

Then  it  can  be  shown  that  the  asymptotic  distribution  of  \f~NmN(9)  is  the  same 
as  that  of  \Z~NmN(9o),  so  Vm  =  plim  N~ 1  JT  m,om'0,  which  can  be  consistently  esti¬ 
mated  by  Vm  =  N~l  JT  m,m'.  The  test  statistic  can  be  computed  in  a  similar  manner 
to  (8.5),  except  the  auxiliary  regression  is  more  simply 

l=m  'jS  +  Ui,  (8.8) 

with  test  statistic  M**  equal  to  N  times  the  uncentered  R 2. 

This  auxiliary  regression  is  valid  for  any  root- /V  consistent  estimator  9,  not  just 
the  MLE,  provided  (8.7)  holds.  The  condition  (8.7)  is  met  in  a  few  examples;  see 
Section  8.2.9  for  an  example. 

Even  if  (8.7)  does  not  hold  the  simpler  regression  (8.8)  might  still  be  run  as  a  guide, 
as  it  places  a  lower  bound  on  the  correct  value  of  M,  the  m-test  statistic.  If  this  simpler 
regression  leads  to  rejection  then  (8.1)  is  certainly  rejected. 


Other  Auxiliary  Regressions 

Alternative  auxiliary  regressions  to  (8.5)  and  (8.8)  are  possible  if  m(y,x,  9)  and 
s(y,  x,  9)  can  be  appropriately  factorized. 
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First,  if  s(y,  x,  9)  =  g(x,  9)r(y,  x,  6)  and  m(y,  x,  6)  =  h(x,  9)r(y,  x,  9)  for  some 
common  scalar  function  r(-)  with  V[r(y,  x,  9)]  =  1  and  estimation  is  by  ML,  then  an 
asymptotically  equivalent  regression  to  (8.5)  is  N from  regression  of?,  on  g,  and  h,. 

Second,  if  m(y,  x,  9)  =  h(x,  9)v(y,  x,  9)  for  some  scalar  function  n(-)  with 
V[i>(y,  x,  9)]  =  1  and  E[3m /d9']  =  0,  then  an  asymptotically  equivalent  regression 
to  (8.8)  is  NR?t  from  regression  of  T?  on  h,.  For  further  details  see  Wooldridge  (1991). 

Additional  auxiliary  regressions  exist  in  special  settings.  Examples  are  given  in 
Section  8.4,  and  White  (1994)  gives  a  quite  general  treatment. 


8.2.3.  Derivations  for  the  m-Test  Statistic 


To  avoid  the  need  to  compute  Vm,  the  variance  matrix  in  (8.3),  m-tests  are  usually 
implemented  using  auxiliary  regressions  or  bootstrap  methods.  For  completeness  this 
section  derives  the  actual  expression  for  Vm  and  provides  justification  for  the  auxiliary 
regressions  (8.5)  and  (8.8). 

The  key  is  obtaining  the  distribution  of  m n(9)  defined  in  (8.2).  This  is  complicated 
because  mN(9)  is  stochastic  for  two  reasons:  the  random  variables  w,  and  evaluation 
at  the  estimator  9. 

Assume  that  9  is  an  m-estimator  or  estimating  equations  estimator  that  solves 

1  N  - 

-^Si(w,,0)=  0,  (8.9) 

1  =  1 

for  some  function  s(-),  here  not  necessarily  31n/(y[x,  9)/d9,  and  make  the  usual 
cross-section  assumption  that  data  are  independent  over  i.  Then  we  shall  show  that 
\/~NmN(9)  -4  7V[0,  Vm],  as  in  (8.3),  where 


Vm  =  H0J0Hq,  (8.10) 

the  h  x  ( h  +  q )  matrix 

H0  =  Wh  —  C0Aq  *],  (8.11) 


where  Co  =  plim  N  1  JT  ()m,o/()6?  and  Ao  =  plim  N  1  JT  f)s,i,/<)9'.  and  the  (h  + 
q)  x  (h  +  q)  matrix 


Jo  =  plim  N  1 


-  i 

E/= i  miomi0 

EN  / 

,-=1  SioKo 


L4im-o4' 

Ei=l  S'0Sj0  _ 


(8.12) 


where  m,0  =  m,(w,,  90)  and  s,0  =  s,(w,-,  90). 

To  derive  (8.10),  take  a  first-order  Taylor  series  expansion  around  90  to  obtain 


VNmN(9)  =  VNmN(90 )  +  dm^o)  VN(0  -  0O)  +  op(  1).  (8.13) 

For  6  defined  in  (8.9)  this  implies  that 


^  1  N  l  N 

VNmN(9)  =  ——  ^m,(0o)  -  C0Aq  :  —  ^s/0  +  op(  1), 
VA/=1  VA  ,-=1 


(8.14) 
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where  we  use  =  A-1  m  >  dmN/dd'  =  A-1  dmi/d6'  -4  Co,  and  s/~N(6  — 
0o )  has  the  same  limit  distribution  as  A(|  A-1/2  s,o  by  applying  the  usual  first-order 
Taylor  series  expansion  to  (8.9).  Equation  (8.14)  can  be  written  as 


VNmN(6)  =  [I,,  —  C0Aq  1  ] 


l 

yiv 

i 

•Jn 


+  op(  1). 


(8.15) 


Equation  (8.10)  follows  by  application  of  the  limit  normal  product  rule  (Theo¬ 
rem  A.  17)  as  the  second  term  in  the  product  in  (8.15)  has  limit  normal  distribution 
under  Ho  with  mean  0  and  variance  Jo. 

To  compute  M  in  (8.4),  a  consistent  estimate  Vm  for  Vm  can  be  obtained  by  replac¬ 
ing  each  component  of  Vm  by  a  consistent  estimate.  For  example,  Co  can  be  consis¬ 
tently  estimated  by  C=A^*  JT  i)m,/(fiT |~,  and  so  on.  Although  this  can  always  be 
done,  using  auxiliary  regressions  is  easier  when  they  are  available. 

First,  consider  the  auxiliary  regression  (8.5)  when  6  is  the  MLE.  By  the  generalized 
IM  equality  (see  Section  5.6.3)  E[3m,o/30/]  =  —  E[m,oS-0],  where  for  the  MLE  we 
specialize  to  s,  =  3  In  /ty, .  x, .  8)/d6' .  Considerable  simplification  occurs  since  then 
Co  =  —  plimA-1  JT  m,oS-0  and  Ao  =  — plimA-1  JT  S/oS-0,  which  also  appear  in  the 
Jo  matrix.  This  leads  to  the  OPG  form  of  the  test.  For  further  details  see  Newey  (1985) 
or  Pagan  and  Vella  (1989). 

Second,  for  the  auxiliary  regression  (8.8),  note  that  if  E[3m,o/30/]  =  0  then  Co  = 
0,  so  Ho  =  [I/,  0]  and  hence  HoJoHq  =  plim/V  1  m,onr'0. 


8.2.4.  Conditional  Moment  Tests 

Conditional  moment  tests,  due  to  Newey  (1985)  and  Tauchen  (1985),  are  m-tests  of 
unconditional  moment  restrictions  that  are  obtained  from  an  underlying  conditional 
moment  restriction. 

As  an  example,  consider  the  linear  regression  model  y  =  x'/3  +  u.  A  standard  as¬ 
sumption  for  consistency  of  the  OLS  estimator  is  that  the  error  has  conditional  mean 
zero,  or  equivalently  the  conditional  moment  restriction 

E[y  —  x'/3|x]  =  0.  (8.16) 

In  Chapter  6  we  considered  using  some  of  the  implied  unconditional  moment  restric¬ 
tions  as  the  basis  of  MM  or  GMM  estimation.  In  particular  (8.16)  implies  that  E[x(y  — 
x'/3)]  =  0.  Solving  the  corresponding  sample  moment  condition  xi(}’i  —  x^/3)  =  0 
leads  to  the  OLS  estimator  for  /3.  However,  (8.16)  implies  many  other  moment  condi¬ 
tions  that  are  not  used  in  estimation.  Consider  the  unconditional  moment  restriction 

E[g(x)(y  -  x'/3)]  =  0, 

where  the  vector  g(x)  should  differ  from  x,  already  used  in  OLS  estimation.  For  exam¬ 
ple,  g(x)  may  contain  the  squares  and  cross-products  of  the  components  of  the  regres¬ 
sor  vector  x.  This  suggests  a  test  based  on  whether  or  not  the  corresponding  sample 
moment  ni;v(/3)  =  N~l  JT  g^-Xy,-  —  x-/3)  is  close  to  zero. 
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More  generally,  consider  the  conditional  moment  restriction 

E[r(y,  x,  0)|x]  =  0,  (8.17) 

for  some  scalar  function  r(-).  The  conditional  (CM)  moment  test  is  an  m-test  based 
on  the  implied  unconditional  moment  restrictions 

E[g(xMy,x,0)]  =  O,  (8.18) 

where  g(x)  and/or  r(y.  x,  9)  are  chosen  so  that  these  restrictions  are  not  already  used 
in  estimation. 

Likelihood-based  models  lead  to  many  potential  restrictions.  For  less  than  fully 
parametric  models  examples  of  r(y,  x,  6)  include  y  —  /i(\.  9),  where  //(•)  is  the  spec¬ 
ified  conditional  mean  function,  and  (y  —  /z(x,  9))2  —  a2(x,  9),  where  rr2(x,  9)  is  a 
specified  conditional  variance  function. 


8.2.5.  White’s  Information  Matrix  Test 


For  ML  estimation  the  information  matrix  equality  implies  moment  restrictions  that 
may  be  used  in  an  m-test,  as  they  are  usually  not  imposed  in  obtaining  the  MLE. 
Specifically,  from  Section  5.6.3  the  IM  equality  implies 

E[VeCh  [D,  (y,-,  X/,  0O)]1  =  0,  (8.19) 


where  the  q  x  q  matrix  D,  is  given  by 


D/O';,  X,-,  9q) 


921  nfi 
8999' 


9 In f  31n.fi 
99  99' 


(8.20) 


and  the  expectation  is  taken  with  respect  to  the  assumed  conditional  density  f)  = 
f(yi  |x,-,  9).  Here  Vech  is  the  vector-half  operator  that  stacks  the  columns  of  the  ma¬ 
trix  D,  in  the  same  way  as  the  Vec  operator,  except  that  only  the  q(q  +  l)/2  unique 
elements  of  the  symmetric  matrix  D,  are  stacked. 

White  (1982)  proposed  the  information  matrix  test  of  whether  the  corresponding 
sample  moment 


N 

d n(9)  =  N~'J2  Vech[D;0;,  x;,?ml)]  (8.21) 

i=  1 


is  close  to  zero.  Using  (8.4)  the  IM  test  statistic  is 

IM  =  NAn(9)'\[-'&n(6),  (8.22) 


where  the  expression  for  V  given  in  White  (1982)  is  quite  complicated.  A  much  easier 
way  to  implement  the  test,  due  to  Lancaster  (1984)  and  Chesher  (1984),  is  to  use  the 
auxiliary  regression  (8.5),  which  is  applicable  since  the  MLE  is  used  in  (8.21). 

The  IM  test  can  also  be  applied  to  a  subset  of  the  restrictions  in  (8.19).  This  should 
be  done  if  q  is  large  as  then  the  number  of  restrictions  q(q  +  l)/2  being  tested  is  very 
large. 

Large  values  of  the  IM  test  statistic  lead  to  rejection  of  the  restrictions  of  the 
IM  equality  and  the  conclusion  that  the  density  is  incorrectly  specified.  In  general 
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this  means  that  the  ML  estimator  is  inconsistent.  In  some  special  cases,  detailed  in 
Section  5.7,  the  MLE  may  still  be  consistent  though  standard  errors  need  then  to  be 
based  on  the  sandwich  form  of  the  variance  matrix. 


8.2.6.  Chi-Square  Goodness-of-Fit  Test 


A  useful  specification  test  for  fully  parametric  models  is  to  compare  predicted  prob¬ 
abilities  with  sample  relative  frequencies.  The  model  is  a  poor  one  if  these  differ 
considerably. 

Begin  with  discrete  iid  random  variable  y  that  can  take  one  of  J  possible  values 
with  probabilities  p\ ,  pi .  . . . ,  pj,  ^;7=1  Pj  =  L  The  correct  specification  of  the  prob¬ 
abilities  can  be  tested  by  testing  the  equality  of  theoretical  frequencies  Npj  to  the 
observed  frequencies  Npj,  where  is  the  fraction  of  the  sample  that  takes  the  /111 
possible  value.  The  Pearson  chi-square  goodness-of-fit  test  (PCGF)  statistic  is 


PCGF  = 

j= i 


(Npj  -  NPj )2 
Npj 


(8.23) 


This  statistic  is  asymptotically  x2(  J  —  1)  distributed  under  the  null  hypothesis  that  the 
probabilities  p\ .  pi. . . . ,  pj  are  correct.  The  test  can  be  extended  to  permit  the  prob¬ 
abilities  to  be  predicted  from  regression  (see  Exercise  8.2).  Consider  a  multinomial 
model  for  discrete  y  with  probabilities  pij  =  /?(/(x, ,  9).  Then  pj  in  (8.23)  is  replaced 
by  'pj  =  A  Yh  F j (xj .  0)  and  if  6  is  the  multinomial  MLE  we  again  get  a  chi-square 
distribution,  but  with  reduced  number  of  degrees  of  freedom  ( J  —  dirn(d)  —  1)  result¬ 
ing  from  the  estimation  of  6  (see  Andrews,  1988a). 

For  regression  models  other  than  multinomial  models,  the  statistic  PCGF  in  (8.23) 
can  be  computed  by  grouping  y  into  cells,  but  the  statistic  PCGF  is  then  no  longer 
chi-square  distributed.  Instead,  a  closely  related  m-test  statistic  is  used.  To  derive  this 
statistic,  break  the  range  of  y  into  J  mutually  exclusive  cells,  where  the  J  cells  span 
all  possible  values  of  y.  Let  d,j  (v, )  be  an  indicator  variable  equal  to  one  if  _y,  e  cell 
j  and  equal  to  zero  otherwise.  Let  p,j(Xi,  9)  =  fy  ecdlj  /(y,lxi,  9)dyi  be  the  predicted 
probability  that  observation  i  falls  in  cell  j,  where  f(y  |x,  9)  is  the  conditional  density 
of  v  and  to  begin  with  we  assume  the  parameter  vector  9  is  known.  If  the  conditional 
density  is  correctly  specified,  then 


E [dijiyi)  -  pijte .  9)]  =  0.  7  =  1 . J.  (8.24) 

Stacking  all  J  moments  in  obvious  vector  notation,  we  have 


E[d/(yi)-p/(*,0)]  =  O,  (8.25) 

where  d,  and  p,  are  J  x  1  vectors  with  / th  entries  and  This  suggests  an  m-test 
of  the  closeness  to  zero  of  the  corresponding  sample  moment 

N 

dpw(?)  =  N~l  £(d i(yt)  -  p /(x, ,  ?)),  (8.26) 

i= 1 

which  is  the  difference  between  the  vector  of  sample  relative  frequencies  A-1  JT  d, 
and  the  vector  of  predicted  frequencies  A-1  p,-.  Using  (8.5)  we  obtain  the 


266 


8.2.  M-TESTS 


chi-square  goodness-of-fit  (CGF)  test  statistic  of  Andrews  (1988a,  1988b): 

CGF  =  NdpN(9)'\-%N(9),  (8.27) 

where  the  expression  for  V  is  quite  complicated.  The  CGF  test  statistic  is  easily  com¬ 
puted  using  the  auxiliary  regression  (8.5),  with  lii,  =  d,  —  p, .  This  auxiliary  regression 
is  appropriate  here  because  a  fully  parametric  model  is  being  tested  and  so  9  will  be 
the  MLE. 

One  of  the  categories  needs  to  be  dropped  because  of  the  restriction  that  probabil¬ 
ities  sum  to  one,  yielding  a  test  statistic  that  is  asymptotically  '/2(J  —  1)  under  the 
null  hypothesis  that  /(y|x,  9)  is  correctly  specified.  Further  categories  may  need  to 
be  dropped  in  some  special  cases,  such  as  the  multinomial  example  already  discussed 
after  (8.23).  In  addition  to  reporting  the  calculated  test  statistic  it  can  be  informative  to 
report  the  components  of  N  JV  d,  and  N  1  JT  p, . 

The  relevant  asymptotic  theory  is  provided  by  Andrews  (1988a),  with  a  simpler 
presentation  and  several  applications  given  in  Andrews  (1988b).  For  simplicity  we 
presented  cells  determined  by  the  range  of  y,  but  the  partitioning  can  be  on  both  y 
and  x.  Cells  should  be  chosen  so  that  no  cell  has  only  a  few  observations.  For  further 
details  and  a  history  of  this  test  see  these  articles. 

For  continuous  random  variable  y  in  the  iid  case  a  more  general  test  than  the  SCGF 
test  is  the  Kolmogorov  test;  this  uses  the  entire  distribution  of  y,  not  just  cells  formed 
from  y.  Andrews  (1997)  presents  a  regression  version  of  the  Kolmogorov  test,  but  it  is 
much  more  difficult  to  implement  than  the  CGF  test. 


8.2.7.  Test  of  Overidentifying  Restrictions 

Tests  of  overidentifying  assumptions  (see  Section  6.3.8)  are  examples  of  m-tests. 

In  the  notation  of  Chapter  6,  the  GMM  estimator  is  based  on  the  assumption  that 
E[h(w,,  0{))\  =  0.  If  the  model  is  overidentified,  then  only  q  of  these  moment  re¬ 
strictions  are  used  in  estimation,  leading  to  (r  —  q)  linearly  dependent  orthogonal¬ 
ity  conditions,  where  r  =  dim[h(  )],  that  can  be  used  to  form  an  m-test.  Then  we 
use  M  in  (8.4),  where  mw  =  N~l  JTh(w,-,  9).  As  shown  in  Section  6.3.9,  if  #is 
the  optimal  GMM  estimator  then  m^vl^/S^'m^vl#),  where  Sn  =  iV-1  h, h' ,  is 
asymptotically  x2(.r  ~  cl)  distributed.  A  more  intuitive  linear  IV  example  is  given  in 
Section  8.4.4. 


8.2.8.  Power  and  Consistency  of  Conditional  Moment  Tests 

Because  there  is  no  explicit  alternative  hypothesis,  m-tests  differ  from  the  tests  of 
Chapter  7. 

Several  authors  have  given  examples  where  the  IM  test  can  be  shown  to  be  equiv¬ 
alent  to  a  conventional  LM  test  of  null  against  alternative  hypotheses.  Chesher  (1984) 
interpreted  the  IM  test  as  a  test  for  random  parameter  heterogeneity.  For  the  linear 
model  under  normality,  A.  Hall  (1987)  showed  that  subcomponents  of  the  IM  test 
correspond  to  LM  tests  of  heteroskedasticity,  symmetry,  and  kurtosis.  Cameron  and 
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Trivedi  (1998)  give  some  additional  examples  and  reference  to  results  for  the  linear 
exponential  family. 

More  generally,  m-tests  can  be  interpreted  in  a  conditional  moment  framework 
as  follows.  Begin  with  an  added  variable  test  in  a  linear  regression  model.  Suppose 
we  want  to  test  whether  /32  =  0  in  the  model  y  =x'l/3l  +  x'2/32  +  u.  This  is  a  test  of 
Hq  :  E[y  —  X| /3 1 1 x ]  =  0  against  Ha  :  E[y  —  xj/3,  |x]  =  x2/32.  The  most  powerful  test 
of  Hq  :  (32  =  0  in  regression  of  y  on  x2  is  based  on  the  efficient  GLS  estimator 


02  = 


E 

1  =  1 


X2/X'2 

of 


-l 


E 

i= 1 


*2 i(yi  -  x,li/31) 


where  of  =  V[_v,  |x,  ]  under  Hq  and  independence  over  i  is  assumed.  This  test  is  equiv¬ 
alent  to  a  test  based  on  the  second  sum  alone,  which  is  an  m-test  of 


E 


x2 i(yi  -  x',,/3,)' 


(8.28) 


Reversing  the  process,  we  can  interpret  an  m-test  based  on  (8.28)  as  a  CM  test  of 
Hq  :E[y  —  xj/3,  |x]  =  0  against  Hu  :  E[y  —  xj /3,  |x]  =  x2 02-  Also,  an  m-test  based 
on  E[x2  (y  —  Xj/3[)]  =  0  can  be  interpreted  as  a  CM  test  of  Hq  :  E[y  —  Xj/3j|x]  =  0 
against  Ha  :  E[y  -  xj/3,  |x|  =  of}xx2f32,  where  of]x  =  V[y|x]  under  Hq. 

More  generally,  suppose  we  start  with  the  conditional  moment  restriction 


E[r(y,-;  X;,  0)|x,]  =  0, 


(8.29) 


for  some  scalar  function  r(-).  Then  an  m-test  based  on  the  unconditional  moment 
restriction 


E[g(x,-  )/'(>v ,  x,  .  #)]  =  0  (8.30) 

can  be  interpreted  as  a  CM  test  with  null  and  alternative  hypotheses 

Hq  :E[r(yi,xi,0)|x;]  =  O,  (8.31) 

Ha  :  E[r(y,,  x,,  0)|x,]  =  ofg(x,)'7, 

where  of  =  V[/-()>, ,  x, ,  6t)|x,|  under  Hq. 

This  approach  gives  a  guide  to  the  directions  in  which  a  CM  test  has  power.  Al¬ 
though  (8.30)  suggests  power  is  in  the  general  direction  of  g(x),  from  (8.31)  a  more 
precise  statement  is  that  it  is  instead  the  direction  of  g(x)  multiplied  by  the  variance 
of  r(y,  x.  9).  The  distinction  is  important  because  many  cross-section  applications  this 
variance  is  not  constant  across  observations.  For  further  details  and  references  see 
Cameron  and  Trivedi  (1998),  who  call  this  a  regression-based  CM  test.  The  approach 
generalizes  to  vector  r(-),  though  with  more  cumbersome  algebra. 

An  m-test  is  a  test  of  a  finite  number  of  moment  conditions.  It  is  therefore  possible  to 
construct  a  dgp  for  which  the  underlying  conditional  moment  condition,  such  as  that  in 
(8.29),  is  false  yet  the  moment  conditions  are  satisfied.  Then  the  CM  test  is  inconsistent 
as  it  fails  to  reject  with  probability  one  as  N  — >  oo.  Bierens  (1990)  proposed  a  way 
to  specify  g(x)  in  (8.30)  that  ensures  a  consistent  conditional  moment  test,  for  tests 
of  functional  form  in  the  nonlinear  regression  model  where  r(y,  x,  9)  =  y  —  f(x ,  9). 
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Ensuring  the  consistency  of  the  test  does  not,  however,  ensure  that  it  will  have  high 
power  against  particular  alternatives. 


8.2.9.  m- Tests  Example 

To  illustrate  various  m-tests  we  consider  the  Poisson  regression  model  introduced  in 
Section  5.2,  with  Poisson  density  f(y)  =  e“M/x-v / y!  and  fi  =  exp (x'/3). 

We  wish  to  test 


Hq  :  E[m(y,  x,  /3)]  =  0, 

for  various  choices  of  m(-).  This  test  will  be  conducted  under  the  assumption  that  the 
dgp  is  indeed  the  specified  Poisson  density. 


Auxiliary  Regressions 

Since  estimation  is  by  ML  we  can  use  the  m-test  statistic  M*  computed  as  N  times  the 
uncentered  R 2  from  auxiliary  regression  (8.5),  where 

1  =  m(y;,  x;,  (3)' 6  +  (y,-  -  exp(x'/3))x'7 +uit  (8.32) 

since  s'  =  |3  In  f(y)/d/3\p  =  (y  —  exp(x'/3))x  and  (3  is  the  MLE.  Under  H0  the  test  is 
X2(dim(m))  distributed. 

An  alternative  is  the  M**  statistic  from  auxiliary  regression 

1  =  m(y,  x,  z,  (3)'6+u.  (8.33) 

This  test  is  asymptotically  equivalent  to  LM*  if  m(-)  is  such  that  E[9m/3/3]  =  0,  but 
otherwise  it  is  not  chi-squared  distributed. 


Moments  Tested 

Correct  specification  of  the  conditional  mean  function,  that  is,  E[y  —  exp(x'/3)|x]  =  0, 
can  be  tested  by  an  m-test  of 

E[(y  -  exp(x'/3))z]  =  0, 

where  z  may  be  a  function  of  x.  For  the  Poisson  and  other  LEF  models,  z  cannot 
equal  x  because  the  first-order  conditions  for  (3ML  impose  the  restriction  that  JT  (y,-  — 
exp(x-/3))x,  =  0,  leading  to  M  =  0  if  z  =  x.  Instead,  z  could  include  squares  and  cross- 
products  of  the  regressors. 

Correct  specification  of  the  variance  may  also  be  tested,  as  the  Poisson  distribution 
implies  conditional  mean-variance  equality.  Since  V[y  |x] — E[y  |x]  =  0,  with  E[y  |x]  = 
exp(x'/3),  this  suggests  an  m-test  of 

E[{(y  -  exp(x'/3))2  -  exp(x'/3)}x]  =  0. 

A  variation  instead  tests 

E[{(y  -  exp(x'/3))2  -  y}x]  =  0, 
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as  E[y|x]  =  exp(x'/3).  Then  m(/3)  =  {(y  —  exp(x/3))2  —  y}x  has  the  property  that 
E[3m/3/3]  =  0,  so  (8.7)  holds  and  the  alternative  regression  (8.33)  yields  an  asymp¬ 
totically  equivalent  test  to  the  regression  (8.32). 

A  standard  specification  test  for  parametric  models  is  the  IM  test.  For  the  Poisson 
density,  D  defined  in  (8.19)  becomes  D(y,  x,  (3)  =  {(y  —  exp(x'/3))2  —  y}xx',  and  we 
test 


E[{(y  —  exp(x'/3))2  —  y}Vech[xx']]  =  0. 

Clearly  for  the  Poisson  example  the  IM  test  is  a  test  of  the  first  and  second  moment  con¬ 
ditions  implied  by  the  Poisson  model,  a  result  that  holds  more  generally  for  LEF  mod¬ 
els.  The  test  statistic  M**  is  asymptotically  equivalent  to  M*  since  here  E[3m/3/3]  =  0. 

The  Poisson  assumption  can  also  be  tested  using  a  chi-square  goodness-of-fit  test. 
For  example,  since  few  counts  exceed  three  in  the  subsequent  simulation  example, 
form  four  cells  corresponding  to  y  =  0,  1,  2,  and  3  or  more,  where  in  implementing 
the  test  the  cell  with  y  =  3  or  more  are  dropped  because  probabilities  sum  to  one. 
So  for  j  =  0, ...  ,2  compute  indicator  djj  =  1  if  y,  =  j  and  djj  =  0  otherwise  and 
compute  predicted  probability  'plj  =  /j ! ,  where  /2;  =  exp(xJ/3).  Then  test 

E[(d  -  p)]  =  0, 

where  d,  =  [di0,  dn,di2\  and  p,  =  [pi0,  pn,  pi2\  by  the  auxiliary  regression  (8.33) 
where  m,  =  d,  —  p, . 


Simulation  Results 

Data  were  generated  from  a  Poisson  model  with  mean  E[y|x]  =  expi/fi  +  f$2x2), 
where  x2  ~  A/"[0,  1]  and  (J3\ ,  f2)  =  (0,  1).  Poisson  ML  regression  of  y  on  x  for  a  sam¬ 
ple  of  size  200  yielded 

E[y|x]  =  exp(— 0.165  +  1.124x2), 

(0.089)  (0.069) 

where  associated  standard  errors  are  in  parentheses. 

The  results  of  the  various  M-tests  are  given  in  Table  8.1. 


Table  8.1.  Specification  m-Tests  for  Poisson  Regression  Example a 


Test  Type 

Hq  where  p  —  exp(jc'/3) 

M* 

dof 

p  -value 

M** 

1 .  Correct  mean 

E[(y  -  /x)x22]  =  0 

3.27 

1 

0.07 

0.44 

2.  Variance  =  mean 

E[{(y  -  m)2  -  Fix]  =  0 

2.43 

2 

0.30 

1.89 

3.  Variance  =  mean 

E[{(y  -  F)2  -  v}x]  =  0 

2.43 

2 

0.30 

2.41 

4.  Information  Matrix 

E[{(y  -  f)2  -  y}Vech[xx']]  =  0 

2.95 

3 

0.40 

2.73 

5.  Chi-square  GOF 

E[d  -  p]  =  0 

2.50 

3 

0.48 

0.75 

a  The  dgp  for  y  is  the  Poisson  distribution  with  mean  parameter  exp(0  +  *2)  and  sample  size  N  =  200.  The 
m-test  statistic  M*  is  chi-squared  with  degrees  of  freedom  given  in  the  dof  column  and  /7-value  given  in  the 
p-value  column.  The  alternative  test  statistic  M**  is  valid  for  tests  3  and  4  only. 
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As  an  example  of  computation  of  M*  using  (8.32)  consider  the  IM  test.  Since  x  = 
[1,  X2]'  and  Vech[xx']  =  [1,  X2,  x\  ]\  the  auxiliary  regression  is  of  1  on  {(y  —  Ji)2  —  y}, 
{(y  —  /x)2  —  y}x 2,  {(y  —  'll)2  —  y}x (y  —  /r),  and  (y  —  jl)x2  and  yields  uncentered 
R2  =  0.01473  and  N  =  200,  leading  to  M*  =  2.95.  The  same  value  of  M*  is  obtained 
directly  from  the  uncentered  explained  sum  of  squares  of  2.95,  and  indirectly  as  N 
minus  197.05,  the  residual  sum  of  squares  from  this  regression.  The  test  statistic  is 
y  2(3)  distributed  with  p  =  0.40,  so  the  null  hypothesis  is  not  rejected  at  significance 
level  0.05. 

For  the  chi-square  goodness-of-fit  test  the  actual  frequencies  are,  respectively, 
0.435, 0.255,  and  0.1 10;  and  the  corresponding  predicted  frequencies  are  0.429,  0.241, 
and  0.124.  This  yields  PCGF  =  0.47  using  (8.23),  but  this  statistic  is  not  chi-squared 
as  it  does  not  control  for  error  in  estimating  (3.  The  auxiliary  regression  for  the  correct 
statistic  CGF  in  (8.27)  leads  to  M*  =  2.50,  which  is  chi-square  distributed. 

In  this  simulation  all  five  moment  conditions  are  not  rejected  at  level  0.05  since 
the  /;- value  for  M*  exceeds  0.05.  This  is  as  expected,  as  the  data  in  this  simulation 
example  are  generated  from  the  specified  density  so  that  tests  at  level  0.05  should  re¬ 
ject  only  5%  of  the  time.  The  alternative  statistic  M**  is  valid  only  for  tests  3  and 
4  since  only  then  does  E[3m/9/3]  =  0;  otherwise,  it  only  provides  a  lower  bound 
for  M. 


8.3.  Hausman  Test 

Tests  based  on  comparisons  between  two  different  estimators  are  called  Hausman  tests, 
after  Hausman  (1978),  or  Wu-Hausman  tests  or  even  Durbin-Wu-Hausman  tests  after 
Wu  (1973)  and  Durbin  (1954)  who  proposed  similar  tests. 


8.3.1.  Hausman  Test 

Consider  a  test  for  endogeneity  of  a  regressor  in  a  single  equation.  Two  alternative  es¬ 
timators  are  the  OLS  and  2SLS  estimators,  where  the  2SLS  estimator  uses  instruments 
to  control  for  possible  endogeneity  of  the  regressor.  If  there  is  endogeneity  then  OLS 
is  inconsistent,  so  the  two  estimators  will  have  different  probability  limit.  If  there  is  no 
endogeneity  both  estimators  are  consistent,  so  the  two  estimators  have  the  same  prob¬ 
ability  limit.  This  suggests  testing  for  endogeneity  by  testing  for  difference  between 
the  OLS  and  2SLS  estimators,  see  Section  8.4.3  for  further  discussion. 

More  generally,  consider  two  estimators  0  and  6.  We  consider  the  testing  situation 
where 

H0  :  p\\m(6  -  0)  =  0, 

Ha  :  plim(0  —  9)  0. 

Assume  the  difference  between  the  two  root-A  consistent  estimators  is  also  root- N 
consistent  under  Hq  with  mean  0  and  a  limit  normal  distribution,  so  that 

V/Vifl  -  0)  4  M  [0,  VH]„ 
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where  VH  denotes  the  variance  matrix  in  the  limiting  distribution.  Then  the  Hausman 
test  statistic 

h  =  (o  -ey(N-lvHrl(e  -0)  (8.35) 

is  asymptotically  y2(q )  distributed  under  Hq.  We  reject  Ht)  at  level  a  if  H  >  y2(q)- 
In  some  applications,  such  as  tests  of  endogeneity,  V[0  —  9 ]  is  of  less  than  full  rank. 
Then  the  generalized  inverse  is  used  in  (8.35)  and  the  chi-square  test  has  degrees  of 
freedom  equal  to  the  rank  of  V[0  —  9], 

The  Hausman  test  can  be  applied  to  just  a  subset  of  the  parameters.  For  example, 
interest  may  lie  solely  in  the  coefficient  of  the  possibly  endogenous  regressor  and 
whether  it  changes  in  moving  from  OLS  to  2SLS.  Then  just  one  component  of  9  is 
used  and  the  test  statistic  is  x2(l)  distributed.  As  in  other  settings,  this  test  on  a  subset 
of  parameters  can  lead  to  a  conclusion  different  from  that  of  a  test  on  all  parameters. 


8.3.2.  Computation  of  the  Hausman  Test 

Computing  the  Hausman  test  is  easy  in  principle  but  difficult  in  practice  owing  to  the 
need  to  obtain  a  consistent  estimate  of  VH,  the  limit  variance  matrix  of  >/N(9  —  9).  In 
general 

N~lWH  =  V[0  -  9]  =  V[0]  +  V[0]  -  2Cov[0,  9],  (8.36) 

The  first  two  quantities  are  readily  computed  from  the  usual  output,  but  the  third  is 
not. 


Computation  for  Fully  Efficient  Estimator  under  the  Null  Hypothesis 

Although  the  essential  null  and  alternative  hypotheses  of  the  Hausman  test  are  as  in 
(8.34),  in  applications  there  is  usually  a  specific  null  hypothesis  model  and  alternative 
hypothesis  in  mind.  For  example,  in  comparing  OLS  and  2SLS  estimators  the  null  hy¬ 
pothesis  model  has  all  regressors  exogenous  whereas  the  alternative  hypothesis  model 
permits  some  regressors  to  be  endogenous. 

If  9  is  the  efficient  estimator  in  the  null  hypothesis  model,  then  Cov[0,  9]  =  V[0]. 
For  proof  see  Exercise  8.3.  This  implies  V[0  —  9]  =  V[0]— V[0],  so 

H  =  (0  —  9)'  (V[0]  -  V[0])  1  (9  -  9).  (8.37) 

This  statistic  has  the  considerable  advantage  of  requiring  only  the  estimated  asymptotic 
variance  matrices  of  the  parameter  estimates  9  and  9.  It  is  helpful  to  use  a  program 
that  permits  saving  parameter  and  variance  matrix  estimates  and  computation  using 
matrix  commands. 

For  example,  this  simplification  can  be  applied  to  endogeneity  tests  in  a  linear  re¬ 
gression  model  if  the  errors  are  assumed  to  be  homoskedastic.  Then  9  is  the  OLS 
estimator  that  is  fully  efficient  under  the  null  hypothesis  of  no  endogeneity,  and  9  is 
the  2SLS  estimator.  Care  is  needed,  however,  to  ensure  the  consistent  estimates  of  the 
variance  matrices  are  such  that  V[0]  —  V[0]  is  positive  definite  (see  Ruud,  1984).  In 
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the  0LS-2SLS  comparison  the  variance  matrix  estimators  V[0]  and  V[ 9  ]  should  use 
the  same  estimate  of  the  error  variance  a2. 

Version  (8.37)  of  the  Hausman  test  is  especially  easy  to  calculate  by  hand  if  9  is  a 
scalar,  or  if  only  one  component  of  the  parameter  vector  is  tested.  Then 

H  =  (?-d)2/Cs2-52) 

is  x2(l)  distributed,  where  7  and  7  are  the  reported  standard  errors  of  9  and  9. 

Auxiliary  Regressions 

In  some  leading  cases  the  Hausman  test  can  be  more  simply  computed  as  a  standard 
test  for  the  significance  of  a  subset  of  regressors  in  an  augmented  OLS  regression, 
derived  under  the  assumption  that  6  is  fully  efficient.  Examples  are  given  in  Section 
8.4.3  and  in  Section  21.4.3. 


Robust  Hausman  Tests 

The  simpler  version  (8.37)  of  the  Hausman  test,  and  standard  auxiliary  regressions, 
requires  the  strong  distributional  assumption  that  6  is  fully  efficient.  This  is  counter 
to  the  approach  of  performing  robust  inference  under  relatively  weak  distributional 
assumptions. 

Direct  estimation  of  Cov[0,  6]  and  hence  Vh  is  in  principle  possible.  Suppose  6  and 
9  are  m-estimators  that  solve  hi,(0)  =  0  and  hi ,(0)  =  0.  Define  6  =  [0,  9], 

Then  V[d  ]  =  Gq  *So(Gq  *)',  where  Go  and  So  are  defined  in  Section  6.6,  with  the  sim¬ 
plification  that  here  G12  =  0.  The  desired  V[ 9  —  9]  =  RV[d]R',  where  R  =  [lq,  —\q  ]. 
Implementation  can  require  additional  coding  that  may  be  application  specific. 

A  simpler  approach  is  to  bootstrap  (see  Section  11.6.3),  though  care  is  needed  in 
some  applications  to  ensure  use  of  the  correct  degrees  of  freedom  in  the  chi-square 
test. 

Another  possible  approach  for  less  than  fully  efficient  9  is  to  use  an  auxiliary  re¬ 
gression  that  is  appropriate  in  the  efficient  case  but  to  perform  the  subsets  of  regres¬ 
sors  test  using  robust  standard  errors.  This  robust  test  is  simple  to  implement  and  will 
have  power  in  testing  the  misspecification  of  interest,  though  it  may  not  necessarily  be 
equivalent  to  the  Hausman  test  that  uses  the  more  general  form  of  H  given  in  (8.35). 
An  example  is  given  in  Section  21.4.3. 

Finally,  bounds  can  be  calculated  that  do  not  require  computation  of  Co v [  9 .  9],  For 
scalar  random  variables,  Cov[x,  y]  <  sxsy.  For  the  scalar  case  this  suggests  an  upper 
bound  for  H  of  N{9  —  9)2/(s2  +  71  —  2 7s),  where)?2  =  V[0]  and T2  =  V[0].  A  lower 
bound  for  H  is  N(9  —  9)2 /(s 2  +  x2),  under  the  assumption  that  9  and  9  are  positively 
correlated.  In  practice,  however,  these  bounds  are  quite  wide. 


8.3.3.  Power  of  the  Hausman  Test 

The  Hausman  test  is  a  quite  general  procedure  that  does  not  explicitly  state  an  alterna¬ 
tive  hypothesis  and  therefore  need  not  have  high  power  against  particular  alternatives. 
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For  example,  consider  tests  of  exclusion  restrictions  in  fully  parametric  models.  De¬ 
note  the  null  hypothesis  Hq  :  02  =  0,  where  9  is  partitioned  as  (6\ ,  9'^)' .  An  obvious 
specification  test  is  a  Hausman  test  of  the  difference  6 \  —  6 1,  where  (6\,  Go)  is  the  un¬ 
restricted  MLE  and  ( 9\ ,  0)  is  the  restricted  MLE  of  9.  Holly  (1982)  showed  that  this 
Hausman  test  coincides  with  a  classical  test  (Wald,  LR,  or  LM)  of  Hq  :  Xj",  X|202  =  0, 
where  X(J  =E [32£(0i,  GiJ/'dG,  dGj],  rather  than  of  Hq  :  Gi  =  0.  The  two  tests  co¬ 
incide  if  X12  is  of  full  column  rank  and  dim(6E)  >dim(£E),  as  then  X^'X^fE  =  0 
iff  9i  =  0.  Otherwise,  they  can  differ.  Clearly,  the  Hausman  test  will  have  no  power 
against  Hq  if  the  information  matrix  is  block  diagonal  as  then  X12  =  0.  Holly  (1987) 
extended  analysis  to  nonlinear  hypotheses. 


8.4.  Tests  for  Some  Common  Misspecifications 

In  this  section  we  present  tests  for  some  common  model  misspecifications.  Attention 
is  focused  on  test  statistics  that  can  be  computed  using  auxiliary  regressions,  using 
minimal  assumptions  to  permit  inference  robust  to  heteroskedastic  errors. 


8.4.1.  Tests  for  Omitted  Variables 


Omitted  variables  usually  lead  to  inconsistent  parameter  estimates,  except  for  special 
cases  such  as  an  omitted  regressor  in  the  linear  model  that  is  uncorrelated  with  the 
other  regressors.  It  is  therefore  important  to  test  for  potential  omitted  variables. 

The  Wald  test  is  most  often  used  as  it  is  usually  no  more  difficult  to  estimate  the 
model  with  omitted  variables  included  than  to  estimate  the  restricted  model  with  omit¬ 
ted  variables  excluded.  Furthermore,  this  test  can  use  robust  sandwich  standard  errors, 
though  this  really  only  makes  sense  if  the  estimator  retains  consistency  in  situations 
where  robust  sandwich  errors  are  necessary. 

If  attention  is  restricted  to  ML  estimation  an  alternative  is  to  estimate  models  with 
and  without  the  potentially  irrelevant  regressors  and  perform  an  LR  test. 

Robust  forms  of  the  LM  test  can  be  easily  computed  in  some  settings.  For  example, 
consider  a  test  of  Hq  :  (32  =  0  in  the  Poisson  model  with  mean  exp (x\/3l  +  x2/32).  The 
LM  test  statistic  is  based  on  the  score  statistic  JV  x,«,,  where  w,  =  y,  —  exp  (x'^/E ) 
(see  Section  7.3.2).  Now  a  heteroskedastic  robust  estimate  for  the  variance  of 
(V~1/2  JT  XjUj,  where  u,  =  yt  —  E[y,-  |x,],  is  N~l  JT  m?x,xJ,  and  it  can  be  shown  that 


LM+ 


n 

^  ^  X/  U  i 
i  =  l 


n 


-1 


i=l 


n 

^  '  X;  U  i 
i=  1 


is  a  robust  LM  test  statistic  that  does  not  require  the  Poisson  restriction  that  V[w,  |x,  ]  = 
expix^/Sj)  under  Hq.  This  can  be  computed  as  N  times  the  uncentered  R2  from  re¬ 
gression  of  1  on  xi jUi  and  x2,m,  .  Such  robust  LM  tests  are  possible  more  generally  for 
assumed  models  in  the  linear  exponential  family,  as  the  score  statistic  in  such  models  is 
again  a  weighted  average  of  a  residual  7t,  (see  Wooldridge,  1991).  This  class  includes 
OLS,  and  adaptations  are  also  possible  when  estimation  is  by  2SLS  or  by  NLS;  see 
Wooldridge  (2002). 
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8.4.2.  Tests  for  Heteroskedasticity 

Parameter  estimates  in  linear  or  nonlinear  regression  models  of  the  conditional  mean 
estimated  by  LS  or  IV  methods  retain  their  consistency  in  the  presence  of  het¬ 
eroskedasticity.  The  only  correction  needed  is  to  the  standard  errors  of  these  estimates. 
This  does  not  require  modeling  heteroskedasticity,  as  heteroskedastic -robust  standard 
errors  can  be  computed  under  minimal  distributional  assumptions  using  the  result  of 
White  (1980).  So  there  is  little  need  to  test  for  heteroskedasticity,  unless  estimator 
efficiency  is  of  great  concern.  Nonetheless,  we  summarize  some  results  on  tests  for 
heteroskedasticity. 

We  begin  with  LS  estimation  of  the  linear  regression  model  y  =  x'(3  +  u.  Suppose 
heteroskedasticity  is  modeled  by  V[«|x]  =  g(a\  +  /'«?),  where  z  is  usually  a  sub¬ 
set  of  x  and  g(-)  is  often  the  exponential  function.  The  literature  focuses  on  tests  of 
Hq  :  OL2  =  0  using  the  LM  approach  because,  unlike  Wald  and  LR  tests,  these  require 
only  OLS  estimation  of  (3.  The  standard  LM  test  of  Breusch  and  Pagan  (1979)  depends 
heavily  on  the  assumption  of  normally  distributed  errors,  as  it  uses  the  restriction  that 
E[m4|x4]  =  3cr 4  under  Ho-  Koenker  (1981)  proposed  a  more  robust  version  of  the  LM 
test,  NR2  from  regression  of  u2  on  1  and  z,,  where  n)  is  the  OLS  residual.  This  test  re¬ 
quires  the  weaker  assumption  that  E[m4|x]  is  constant.  Like  the  Breusch-Pagan  test  it 
is  invariant  to  choice  of  the  function  y(-).  The  White  (1980a)  test  for  heteroskedasticity 
is  equivalent  to  this  LM  test,  with  z  =  Vechfxx'].  The  test  can  be  further  generalized 
to  let  E[m4|x]  vary  with  x,  though  constancy  may  be  a  reasonable  assumption  for  the 
test  since  Hq  already  specifies  that  E[m2|x|  is  constant. 

Qualitatively  similar  results  carry  over  to  nonlinear  models  of  the  conditional  mean 
that  assume  a  particular  form  of  heteroskedasticity  that  may  be  tested  for  misspec- 
ification.  For  example,  the  Poisson  regression  model  sets  V[y|x]  =  exp  (x'/3).  More 
generally,  for  models  in  the  linear  exponential  family,  the  quasi-MLE  is  consistent 
despite  misspecified  heteroskedasticity  and  qualitatively  similar  results  to  those  here 
apply.  Then  valid  inference  is  possible  even  if  the  model  for  heteroskedasticity  is  mis¬ 
specified,  provided  the  robust  standard  errors  presented  in  Section  5.7.4  are  used.  If 
one  still  wishes  to  test  for  correct  specification  of  heteroskedasticity  then  robust  LM 
tests  are  possible  (see  Wooldridge,  1991). 

Heteroskedasticity  can  lead  to  the  more  serious  consequence  of  inconsistency  of  pa¬ 
rameter  estimates  in  some  nonlinear  models.  A  leading  example  is  the  Tobit  model  (see 
Chapter  16),  a  linear  regression  model  with  normal  homoskedastic  errors  that  becomes 
nonlinear  as  the  result  of  censoring  or  truncation.  Then  testing  for  heteroskedasticity 
becomes  more  important.  A  model  for  V[m|x]  can  be  specified  and  Wald,  LR,  or  LM 
tests  can  be  performed  or  m-tests  for  heteroskedasticity  can  be  used  (see  Pagan  and 
Vella,  1989). 


8.4.3.  Hausman  Tests  for  Endogeneity 

Instrumental  variables  estimators  should  only  be  used  where  there  is  a  need  for  them, 
since  LS  estimators  are  more  efficient  if  all  regressors  are  exogenous  and  from  Sec¬ 
tion  4.9  this  loss  of  efficiency  can  be  substantial.  It  can  therefore  be  useful  to  test 
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whether  IV  methods  are  needed.  A  test  for  endogeneity  of  regressors  compares  IV 
estimates  with  LS  estimates.  If  regressors  are  endogenous  then  in  the  limit  these  esti¬ 
mates  will  differ,  whereas  if  regressors  are  exogenous  the  two  estimators  will  not  differ. 
Thus  large  differences  between  LS  and  IV  estimates  can  be  interpreted  as  evidence  of 
endogeneity. 

This  example  provides  the  original  motivation  for  the  Hausman  test.  Consider  the 
linear  regression  model 

y  —  xj/3j  +  x'2/32  +  u,  (8.38) 

where  xi  is  potentially  endogenous  and  x2  is  exogenous.  Let  (3  be  the  OLS  estimator 
and  /3  be  the  2SLS  estimator  in  (8.38).  Assuming  homoskedastic  errors  so  that  OLS  is 
efficient  under  the  null  hypothesis  of  no  endogeneity,  a  Hausman  test  of  endogeneity  of 
xi  can  be  calculated  using  the  test  statistic  H  defined  in  (8.37).  Because  V[/3]  —  V[/3] 
can  be  shown  to  be  not  of  full  rank,  however,  a  generalized  inverse  is  needed  and  the 
degrees  of  freedom  are  diml^)  rather  than  dim(/3). 

Hausman  (1978)  showed  that  the  test  can  more  simply  be  implemented  by  test  of 
7  =  0  in  the  augmented  OLS  regression 

y  =  x'i/3i  +  x;/32  +  x,17  +  u , 

where  xj  is  the  predicted  value  of  the  endogenous  regressors  xi  from  reduced  form 
multivariate  regression  of  Xj  on  the  instruments  z.  Equivalently,  we  can  test  7  =  0  in 
the  augmented  OLS  regression 

y  =  x'iPi  +  x2/32  +  v'i7+m, 

where  v)  is  the  residual  from  the  reduced  form  multivariate  regression  of  xj  on  the 
instruments  z.  Intuition  for  these  tests  is  that  if  u  in  (8.38)  is  uncorrelated  with  xi 
and  x2,  then  7  =  0.  If  instead  u  is  correlated  with  xi,  then  this  will  be  picked  up  by 
significance  of  additional  transformations  of  xi  such  as  x)  and  v). 

For  cross-section  data  it  is  customary  to  presume  heteroskedastic  errors.  Then  the 
OLS  estimator  f3  is  inefficient  in  (8.38)  and  the  simpler  version  (8.37)  of  the  Haus¬ 
man  test  cannot  be  used.  However,  the  preceding  augmented  OLS  regressions  can 
still  be  used,  provided  7  =  0  is  tested  using  the  heteroskedastic-consistent  estimate  of 
the  variance  matrix.  This  should  actually  be  equivalent  to  the  Hausman  test,  as  from 
Davidson  and  MacKinnon  (1993,  p.  239)  7ols  in  these  augmented  regressions  equals 
AN(j3  —  0),  where  A N  is  a  full-rank  matrix  with  finite  probability  limit. 

Additional  Hausman  tests  for  endogeneity  are  possible.  Suppose  y  =  x\/3l  + 
x2/32  +  Xj/33  +  u,  where  X|  is  potentially  endogenous  x2  is  assumed  to  be  endoge¬ 
nous,  and  X3  is  assumed  to  be  exogenous.  Then  endogeneity  of  X]  can  be  tested 
by  comparing  the  2SLS  estimator  with  just  x2  instrumented  to  the  2SLS  estima¬ 
tor  with  both  X]  and  x2  instrumented.  The  Hausman  test  can  also  be  generalized 
to  nonlinear  regression  models,  with  OLS  replaced  by  NLS  and  2SLS  replaced 
by  NL2SLS.  Davidson  and  MacKinnon  (1993)  present  augmented  regressions  that 
can  be  used  to  compute  the  relevant  Hausman  test,  assuming  homoskedastic  errors. 
Mroz  (1987)  provides  a  good  application  of  endogeneity  tests  including  examples  of 
computation  of  V[0  —  8 ]  when  8  is  not  efficient. 
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8.4.4.  OIR  Tests  for  Exogeneity 

If  an  IV  estimator  is  used  then  the  instruments  must  be  exogenous  for  the  IV  estimator 
to  be  consistent.  For  just-identified  models  it  is  not  possible  to  test  for  instrument 
exogeneity.  Instead,  a  priori  arguments  need  to  be  used  to  justify  instrument  validity. 
Some  examples  are  given  in  Section  4.8.2.  For  overidentified  models,  however,  a  test 
for  exogeneity  of  instruments  is  possible. 

We  begin  with  linear  regression.  Then  y  =  x'/3  +  u  and  instruments  z  are  valid 
if  E[m|z]  =  0  or  if  E[z«]  =  0.  An  obvious  test  of  Hq  :  E[z«]  =  0  is  based  on  depar¬ 
tures  of  N~l  JT  z (Hi  from  zero.  In  the  just-identified  case  the  IV  estimator  solves 
N  1  JT  ZjUj  =  0  so  this  test  is  not  useful.  In  the  overidentified  case  the  overidentify¬ 
ing  restrictions  test  presented  in  Section  6.3.8  is 

OIR  =  u'ZS'Z'u,  (8.39) 


where  u  =  y  —  X/3,  [3  is  the  optimal  GMM  estimator  that  minimizes  u'ZS  ]Z'u,  and 
S  is  consistent  for  plim  N  1  JV  n?z,z-.  The  OIR  test  of  Hansen  (1982)  is  an  extension 
of  a  test  proposed  by  Sargan  (1958)  for  linear  IV,  and  the  test  statistic  (8.39)  is  often 
called  a  Sargan  test.  If  OIR  is  large  then  the  moment  conditions  are  rejected  and  the 
IV  estimator  is  inconsistent.  Rejection  of  Hq  is  usually  interpreted  as  evidence  that  the 
instruments  z  are  endogenous,  but  it  could  also  be  evidence  of  model  misspecifica- 
tion  so  that  in  fact  y  ^  x'/3  +  u.  In  either  case  rejection  indicates  problems  for  the  IV 
estimator. 

As  formally  derived  in  Section  6.3.9,  OIR  is  distributed  as  y2(r  ~  K)  under  Hq, 
where  (r  —  K)  is  the  number  of  overidentifying  restrictions.  To  gain  some  intuition  for 
this  result  it  is  useful  to  specialize  to  homoskedastic  errors.  Then  S  =  ct2Z'Z,  where 
a2  =  u'u/( N  —  K),  so 


OIR  = 


u'Pzu 


u'u/iN  -  K)’ 


where  Pz  =  ZtZ'Z)- 1  Z'.  Thus  OIR  is  a  ratio  of  quadratic  forms  in  u.  Under  Hq  the 
numerator  has  probability  limit  a2(r  —  K)  and  the  denominator  has  plim  a2  =  a2,  so 
the  ratio  is  centered  on  r  —  K,  but  this  is  the  mean  of  a  y  2(r  —  K)  random  variable. 

The  test  statistic  in  (8.39)  extends  immediately  to  nonlinear  regression,  by  simply 
defining  u,  =  y  —  g(x,  /3)  or  u  =  r(y,  x,  (3)  as  in  Section  6.5,  and  to  linear  systems 
and  panel  estimators  by  appropriate  definition  of  u  (see  Sections  6.9  and  6.10). 

For  linear  IV  with  homoskedastic  errors  alternative  OIR  tests  to  (8.39)  have  been 
proposed.  Magdalinos  (1988)  contrasts  a  number  of  these  tests.  One  can  also  use  in¬ 
cremental  OIR  tests  of  a  subset  of  overidentifying  restrictions. 


8.4.5.  RESET  Test 

A  common  functional  form  misspecification  may  involve  neglected  nonlinearity  in 
some  of  the  regressors.  Consider  the  regression  y  =  x'(3  +  u,  where  we  assume  that  the 
regressors  enter  linearly  and  are  asymptotically  uncorrelated  with  the  error  u.  To  test 
for  nonlinearity  one  straightforward  approach  is  to  enter  power  functions  of  exogenous 
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variables,  most  commonly  squares,  as  additional  independent  regressors  and  test  the 
statistical  significance  of  these  additional  variables  using  a  Wald  test  or  an  F-test. 
This  requires  the  investigator  to  have  specific  reasons  for  considering  nonlinearity,  and 
clearly  the  technique  will  not  work  for  categorical  x  variables. 

Ramsey  (1969)  suggested  a  test  of  omitted  variables  from  the  regression  that  can 
be  formulated  as  a  test  of  functional  form.  The  proposal  is  to  fit  the  initial  regres¬ 
sion  and  generate  new  regressors  that  are  functions  of  fitted  values  y  =  x'/3,  such 
as  w  =  [(x'/3)2,  (x'/3)3, . . . ,  (x'f3)p\.  Then  estimate  the  model  y  =  x'/3  +  w'7  +  u, 
and  the  test  of  nonlinearity  is  the  Wald  test  of  p  restrictions,  Hq  :  7  =  0  against 
Ha\  7  /  0.  Typically  a  low  value  of  p  such  as  2  or  3  is  used.  This  test  can  be  made 
robust  to  heteroskedasticity. 


8.5.  Discriminating  between  Nonnested  Models 

Two  models  are  nested  if  one  is  a  special  case  of  the  other;  they  are  nonnested  if 
neither  can  be  represented  as  a  special  case  of  the  other.  Discriminating  between  nested 
models  is  possible  using  a  standard  hypothesis  test  of  the  parametric  restrictions  that 
reduce  one  model  to  the  other.  In  the  nonnested  case,  however,  alternative  methods 
need  to  be  developed. 

The  presentation  focuses  on  nonnested  model  discrimination  within  the  likelihood 
framework,  where  results  are  well  developed.  A  brief  discussion  of  the  nonlikelihood 
case  is  given  in  Section  8.5.4.  Bayesian  methods  for  model  discrimination  are  pre¬ 
sented  in  Section  13.8. 


8.5.1.  Information  Criteria 

Information  criteria  are  log-likelihood  criteria  with  degrees  of  freedom  adjustment. 
The  model  with  the  smallest  information  criterion  is  preferred. 

The  essential  intuition  is  that  there  exists  a  tension  between  model  fit,  as  measured 
by  the  maximized  log-likelihood  value,  and  the  principle  of  parsimony  that  favors  a 
simple  model.  The  fit  of  the  model  can  be  improved  by  increasing  model  complexity. 
However,  parameters  are  only  added  if  the  resulting  improvement  in  fit  sufficiently 
compensates  for  loss  of  parsimony.  Note  that  in  this  viewpoint  it  is  not  necessary 
that  the  set  of  models  under  consideration  should  include  the  “true  dgp.”  Different 
information  criteria  vary  in  how  steeply  they  penalize  model  complexity. 

Akaike  (1973)  originally  proposed  the  Akaike  information  criterion 

AIC  =  — 21nL  +  2q,  (8.40) 

where  q  is  the  number  of  parameters,  with  the  model  with  lowest  AIC  preferred.  The 
term  information  criterion  is  used  because  the  underlying  theory,  presented  more  sim¬ 
ply  in  Amemiya  (1980),  discriminates  among  models  using  the  Kullback-Liebler  in¬ 
formation  criterion  (KLIC). 

A  considerable  number  of  modifications  to  AIC  have  been  proposed,  all  of  the  form 
— 21nL+g(g,  N)  for  specified  penalty  function  g(-)  that  exceeds  2 q.  The  most  popular 
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variation  is  the  Bayesian  information  criterion 

BIC  =  — 2  In  L  +  (In  A)g ,  (8.41) 

proposed  by  Schwarz  (1978).  Schwarz  assumed  y  has  density  in  the  exponential  family 
with  parameter  9,  the  / tli  model  has  parameter  9  j  with  dim[07]  =  qt  <  dim[0],  and 
the  prior  across  models  is  a  weighted  sum  of  the  prior  for  each  6  7.  He  showed  that  un¬ 
der  these  assumptions  maximizing  the  posterior  probability  (see  Chapter  13)  is  asymp¬ 
totically  equivalent  to  choosing  the  model  for  which  In  L  —  (In  N)qj  /2  is  largest.  Since 
this  is  equivalent  to  minimizing  (8.41),  the  procedure  of  Schwarz  has  been  labeled  the 
Bayesian  information  criterion.  A  refinement  of  AIC  based  on  minimization  of  KLIC 
that  is  similar  to  BIC  is  the  consistent  AIC,  CAIC=  — 21nL  +  (1  +  In  N )q.  Some 
authors  define  criteria  such  as  AIC  and  BIC  by  additionally  dividing  by  N  in  the  right- 
hand  sides  of  (8.40)  and  (8.41). 

If  model  parsimony  is  important,  then  BIC  is  more  widely  used  as  the  model-size 
penalty  for  AIC  is  relatively  low.  Consider  two  nested  models  with  q\  and  q2  parame¬ 
ters,  respectively,  where  q2  =  q\  +  h.  An  LR  test  is  then  possible  and  favors  the  larger 
model  at  significance  level  5%  if  21nL  increases  by  x2)5(/t).  AIC  favors  the  larger 
model  if  2  In  L  increases  by  more  than  2h,  a  lesser  penalty  for  model  size  than  the  LR 
test  if  h  <  7.  In  particular  for  h  =  1,  that  is,  one  restriction,  the  LR  test  uses  a  5% 
critical  value  of  3.84  whereas  AIC  uses  a  much  lower  value  of  2.  The  BIC  favors  the 
larger  model  if  2  In  L  increases  by  h  In  N,  a  much  larger  penalty  than  either  AIC  or  an 
LR  test  of  size  0.05  (unless  N  is  exceptionally  small). 

The  Bayesian  information  criterion  increases  the  penalty  as  sample  size  increases, 
whereas  traditional  hypothesis  tests  at  a  significance  level  such  as  5%  do  not.  For 
nested  models  with  q2  =  q i  +  1  choosing  the  larger  model  on  the  basis  of  lower  BIC 
is  equivalent  to  using  a  two-sided  t-test  critical  value  of  \/ln  N,  which  equals  2.15, 
3.03,  and  3.72,  respectively,  for  N  =  102,  104,  and  106.  By  comparison  traditional  hy¬ 
pothesis  tests  with  size  0.05  use  an  unchanging  critical  value  of  1 .96.  More  generally, 
for  a  x2(fi)  distributed  test  statistic  the  BIC  suggests  using  a  critical  value  of  h  In  A 
rather  than  the  customary  x\)5{h). 

Given  their  simplicity,  penalized  likelihood  criteria  are  often  used  for  selecting  “the 
best  model.”  However,  there  is  no  clear  answer  as  to  which  criterion,  if  any,  should 
be  preferred.  Considerable  approximation  is  involved  in  deriving  the  formulas  for  AIC 
and  related  measures,  and  loss  functions  other  than  minimization  of  KLIC,  or  max¬ 
imization  of  the  posterior  probability  in  the  case  of  BIC,  might  be  much  more  ap¬ 
propriate.  From  a  decision-theoretic  viewpoint,  the  choice  of  the  model  from  a  set  of 
models  should  depend  on  the  intended  use  of  that  model.  For  example,  the  purpose  of 
the  model  may  be  to  summarize  the  main  features  of  a  complex  reality,  or  to  predict 
some  outcome,  or  to  test  some  important  hypothesis.  In  applied  work  it  is  quite  rare  to 
see  an  explicit  statement  of  the  intended  use  of  an  econometric  model. 


8.5.2.  Cox  Likelihood  Ratio  Test  of  Nonnested  Models 

Consider  choosing  between  two  parametric  models.  Let  model  Fg  have  density 
/(y|x,  9)  and  model  G7  have  density  g(y|x,  7). 
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A  likelihood  ratio  test  of  the  model  Fg  against  Gy  is  based  on 

LR(0,  7)  =  Cf(9)  -  Cg{ 7)  =  ln  •  (8-42) 

i— i  §(yi  lx(  >  Tl 

If  Gr  is  nested  in  Fg  then,  from  Section  7.3.1,  2LR(0,  7)  is  chi-square  distributed 
under  the  null  hypothesis  that  Fg  =  G  y .  However,  this  result  no  longer  holds  if  the 
models  are  nonnested. 

Cox  (1961,  1962b)  proposed  solving  this  problem  in  the  special  case  that  Fg  is  the 
true  model  but  the  models  are  not  nested,  by  applying  a  central  limit  theorem  under 
the  assumption  that  Fg  is  the  true  model. 

This  approach  is  computationally  awkward  to  implement  if  one  cannot  analytically 
obtain  Ey[ln(/(y|x,  0)/g(y |x,  7))],  where  E;  denotes  expectation  with  respect  to  the 
density  /(y|x,  6).  Furthermore,  if  a  similar  test  statistic  is  obtained  with  the  roles  of 
Fg  and  G7  reversed  it  is  possible  to  find  both  that  model  Fg  is  rejected  in  favor  of 
G7  and  that  model  G7  is  rejected  in  favor  of  Fg.  The  test  is  therefore  not  necessarily 
one  of  model  selection  as  it  does  not  necessarily  select  one  or  the  other;  instead  it  is  a 
model  specification  test  that  zero,  one,  or  two  of  the  models  can  pass. 

The  Cox  statistic  has  been  obtained  analytically  in  some  cases.  For  nonnested 
linear  regression  models  y  =  x'  (3  +  u  and  y  =  z'7  +  v  with  homoskedastic  nor¬ 
mally  distributed  errors  (see  Pesaran,  1974).  For  nonnested  transformation  models 
h(y)  =  x' ft  +  u  and  g  (y)  =  z'7  +  v,  where  h(y)  and  g(y)  are  known  transforma¬ 
tions;  see  Pesaran  and  Pesaran  (1995),  who  use  a  simulation-based  approach.  This 
permits,  for  example,  discrimination  between  linear  and  log-linear  parametric  mod¬ 
els,  with  /;(•)  the  identity  transformation  and  g(-)  the  log  transformation.  Pesaran  and 
Pesaran  (1995)  apply  the  idea  to  choosing  between  logit  and  probit  models  presented  in 
Chapter  14. 


8.5.3.  Vuong  Fikelihood  Ratio  Test  of  Nonnested  Models 


Vuong  (1989)  provided  a  very  general  distribution  theory  for  the  FR  test  statistic  that 
covers  both  nested  and  nonnested  models  and  more  remarkably  permits  the  dgp  to  be 
an  unknown  density  that  differs  from  both  /(•)  and  g(-). 

The  asymptotic  results  of  Vuong,  presented  here  to  aid  understanding  of  the  variety 
of  tests  presented  in  Vuong’s  paper,  are  relatively  complex  as  in  some  cases  the  test 
statistic  is  a  weighted  sum  of  chi-squares  with  weights  that  can  be  difficult  to  compute. 

Vuong  proposed  a  test  of 


H0  :E0 


r,  f(y |x,0)l 

ln - 

.  g(.y|x,  7). 


=  0, 


(8.43) 


where  E()  denotes  expectation  with  respect  to  the  true  dgp  /z(y |x),  which  may  be  un¬ 
known.  This  is  equivalent  to  testing  E/,[ln(/z/g)]—  E/,[ln(/7//)]  =  0,  or  testing  whether 
the  two  densities  /  and  g  have  the  same  Kullback-Fiebler  information  criterion 
(see  Section  5.7.2).  One-sided  alternatives  are  possible  with  Hf  :  Eo[ln(//g)]  >  0  and 

Hg:EoMf/g)\  <0. 
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An  obvious  test  of  Ho  is  an  m-test  of  whether  the  sample  analogue  LR(0,  7)  defined 
in  (8.42)  differs  from  zero.  Here  the  distribution  of  the  test  statistic  is  to  be  obtained 
with  possibly  unknown  dgp.  This  is  possible  because  from  Section  5.7.1  the  quasi- 
MLE  0  converges  to  the  pseudo-true  value  9*  and  s/~N(9  —  9*)  has  a  limit  normal 
distribution,  with  a  similar  result  for  the  quasi-MLE  7. 


General  Result 


The  resulting  distribution  of  LR(0,  7)  varies  according  to  whether  or  not  the  two  mod¬ 
els,  both  possibly  incorrect,  are  equivalent  in  the  sense  that  /(y|x,  9 *)  =  g(y|x,  7*), 
where  0*  and  7*  are  the  pseudo-true  values  of  9  and  7. 

If  / (y |x,  0*)  =  g(y|x,  7*)  then 

2LR(0,7)4  Mp+q{\),  (8.44) 


where  p  and  q  are  the  dimensions  of  9  and  7  and  Mp  Vq  ( A* )  denotes  the  cdf  of  the 
weighted  sum  of  chi-squared  variables  Z*/Z2.  The  '/?■  are  iid  x2(l)  and  A.*;-  are 

the  eigenvalues  of  the  (p  +  q)  x  (p  +  q)  matrix 


-BfVJAfie.)-'  -B /g(0„  7*)A,(7*)-1 

-B,/(7*.  OJAfie.r1  -Bg(7*)As(7,rl 


where  A f(9*)  =  E()[32  In  //3030'],  Bf(0  *)  =  E0[(3  In  f/d9)(d  In  the  matri¬ 

ces  As(7+)  and  B„(7*)  are  similarly  defined  for  the  density  g(-),  the  cross-matrix 
B /g(0*,7*)  =Eo[(31n//30)(31ng/97/)],  and  expectations  are  with  respect  to  the 
true  dgp.  For  explanation  and  derivation  of  these  results  see  Vuong  (1989). 

If  instead  f(y  |x,  0*)  /  g(y|x,  7,),  then  under  H0 

N~1/2LR(9, 7)  4  7V[0,  call  (8.46) 


where 


col  =  Vo 


'  /(y|x,  0*)’ 

In -  , 

.  g(;y|x,7*)_ 


(8.47) 


and  the  variance  is  with  respect  to  the  true  dgp.  For  derivation  again  see  Vuong  (1989). 

Use  of  these  results  varies  with  whether  or  not  one  model  is  assumed  to  be  correctly 
specified  and  with  the  nesting  relationship  between  the  two  models. 

Vuong  differentiated  among  three  types  of  model  comparisons.  The  models  Fg  and 
G7  are  (1)  nested  with  G7  nested  in  Fg  if  G7  C  Fj;  (2)  strictly  nonnested  models 
if  and  only  if  Fg  D  G7  =  0  so  that  neither  model  can  specialize  to  the  other;  and 
(3)  overlapping  if  Fg  D  G7  7^  </>  and  Fg  £  G7  and  G7  £  Fg.  Similar  distinctions  are 
made  by  Pesaran  and  Pesaran  (1995). 

Both  (2)  and  (3)  are  nonnested  models,  but  they  require  different  testing  procedures. 
Examples  of  strictly  nonnested  models  are  linear  models  with  different  error  distribu¬ 
tions  and  nonlinear  regression  models  with  the  same  error  distributions  but  different 
functional  forms  for  the  conditional  mean.  For  overlapping  models  some  specializa¬ 
tions  of  the  two  models  are  equal.  An  example  is  linear  models  with  some  regressors 
in  common  and  some  regressors  not  in  common. 
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Nested  Models 

For  nested  models  it  is  necessarily  the  case  that  f(y\x,  9 *)  =  g(y|x,  7*).  For  G7 
nested  in  Fg,  Hf)  is  tested  against  Hf :  E0[ln(//g)]  >  0. 

For  density  possibly  misspecified  the  weighted  chi-square  result  (8.44)  is  appropri¬ 
ate,  using  the  eigenvalues  Xj  of  the  sample  analogue  of  W  in  (8.45).  Alternatively,  one 
can  use  eigenvalues  Xj  of  the  sample  analogue  of  the  smaller  matrix 

W  =  B/(0*)[D(7jAg(7j-1D(7J'  -  A/10*)-1], 

where  0(7*)  =  3</>(7*)/37  and  the  constrained  quasi-MLE  6  =  </>( 7),  see  Vuong 
(1989).  This  result  provides  a  robustified  version  of  the  standard  LR  test  for  nested 
models. 

If  the  density  /(•)  is  actually  correctly  specified,  or  more  generally  satisfies  the  IM 
equality,  we  get  the  expected  result  that  2LR (9,  7)  -a-  x2(p  —  7)  as  then  ( p  —  q)  of 
the  eigenvalues  of  W  or  W  equal  one  whereas  the  others  equal  zero. 


Strictly  Nonnested  Models 


For  strictly  nonnested  models  it  is  necessarily  the  case  that  /(y|x,  0*)  ^  g(y|x,  7*). 
The  normal  distribution  result  (8.46)  is  applicable,  and  a  consistent  estimate  of  a>2  is 


or 


=  1E(ln 

N  ^  » 


(=i 


f(yi\xh  9) 

g(y/|x(-,7) 


—  l  —  V  In  ^ 

\ N  frt  n  g(yi|x,-,  7) 


Thus  form 


Tlr  =  A_1/2LR(0,  7)/S  4  A/To,  1]. 


(8.48) 


(8.49) 


For  tests  with  critical  value  c,  H{]  is  rejected  in  favor  of  Hf ■:  Eo[ln(//g)]  >  0  if 
Tlr  >  c,  Hq  is  rejected  in  favor  of  H„  :  E()[ln(//g)]  <  0  if  TLr  <  —  c,  and  discrimi¬ 
nation  between  the  two  models  is  not  possible  if  |TLR|  <  c.  The  test  can  be  modified 
to  permit  log-likelihood  penalties  similar  to  AIC  and  BIC;  see  Vuong  (1989,  p.  316). 
An  asymptotically  equivalent  statistic  to  (8.49)  replaces  or  by  or  equal  to  just  the  first 
term  in  the  right-hand  side  of  (8.48). 

This  test  assumes  that  both  models  are  misspecified.  If  instead  one  of  the  models  is 
assumed  to  be  correctly  specified,  the  Cox  test  approach  of  Section  8.5.2  needs  to  be 
used. 


Overlapping  Models 

For  overlapping  models  it  is  not  clear  a  priori  as  to  whether  or  not  /(\jx.  9t)  = 
g(y|x,  7*),  and  one  needs  to  first  test  this  condition. 

Vuong  (1989)  proposes  testing  whether  or  not  the  variance  (o\  defined  in  (8.47) 
equals  zero,  since  oo\  =  0  if  and  only  if  /(•)  =  g(-).  Thus  compute  m2  in  (8.48).  Under 
Hq  '-0)1=0 

Nor  4  Mp+q( A*),  (8.50) 
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where  the  Mp+q( A*)  distribution  is  defined  after  (8.44).  Hypothesis  is  rejected  at 
level  a  if  Nar  exceeds  the  upper  a  percentile  of  the  Mp+q(\)  distribution,  using  the 
eigenvalues  Xj  of  the  sample  analogue  of  W  in  (8.45).  Alternatively,  and  more  simply, 
one  can  test  the  conditions  that  0*  and  7*  must  satisfy  for  /(•)  =  «(■).  Examples  are 
given  in  Lien  and  Vuong  (1987). 

If  Hq  is  not  rejected,  or  the  conditions  for  /(•)  =  g(-)  are  not  rejected,  conclude 
that  it  is  not  possible  to  discriminate  between  the  two  models  given  the  data.  If  Hq  is 
rejected,  or  the  conditions  for  /(•)  =  g(-)  are  rejected,  then  test  No  against  Hf  or  Hg 
using  Tlr  as  detailed  in  the  strictly  nonnested  case.  In  this  latter  case  the  significance 
level  is  at  most  the  maximum  of  the  significance  levels  for  each  of  the  two  tests. 

This  test  assumes  that  both  models  are  misspecified.  If  instead  one  of  the  models  is 
assumed  to  be  correctly  specified,  then  the  other  model  must  also  be  correctly  specified 
for  the  two  models  to  be  equivalent.  Thus  /(y|x,  #*)  =  g(y|x,  77  under  Hq,  and  one 
can  directly  move  to  the  LR  test  using  the  weighted  chi-square  result  (8.44).  Let  c\  and 
ci  be  upper  tail  and  lower  tail  critical  values,  respectively.  If  2LR (9,  7)  >  c\  then  Hq 
is  rejected  in  favor  of  Hj ;  if  2LR(0,  7)  <  C2  then  Hq  is  rejected  in  favor  of  H,, ;  and 
the  test  is  otherwise  inconclusive. 


8.5.4.  Other  Nonnested  Model  Comparisons 

The  preceding  methods  are  restricted  to  fully  parametric  models.  Methods  for  discrim¬ 
inating  between  models  that  are  only  partially  parameterized,  such  as  linear  regression 
without  the  assumption  of  normality,  are  less  clear-cut. 

The  information  criteria  of  Section  8.5.1  can  be  replaced  by  criteria  developed  using 
loss  functions  other  than  KLIC.  A  variety  of  measures  corresponding  to  different  loss 
functions  are  presented  in  Amemiya  (1980).  These  measures  are  often  motivated  for 
nested  models  but  may  also  be  applicable  to  nonnested  models. 

A  simple  approach  is  to  compare  predictive  ability,  selecting  the  model  with  low¬ 
est  value  of  mean-squared  error  ( N  —  q)~l  —  yi)2-  For  linear  regression  this  is 

equivalent  to  choosing  the  model  with  highest  adjusted  R2,  which  is  generally  viewed 
as  providing  too  small  a  penalty  for  model  complexity.  An  adaptation  for  nonparamet- 
ric  regression  is  leave-one-out  cross-validation  (see  Section  9.5.3). 

Formal  tests  to  discriminate  between  nonnested  models  in  the  nonlikelihood  case 
often  take  one  of  two  approaches.  Artificial  nesting,  proposed  by  Davidson  and 
MacKinnon  (1984),  embeds  the  two  nonnested  models  into  a  more  general  artificial 
model  and  leads  to  so-called  J  tests  and  P  tests  and  related  tests.  The  encompassing 
principle,  proposed  by  Mizon  and  Richard  (1986),  leads  to  a  quite  general  framework 
for  testing  one  model  against  a  competing  nonnested  model.  White  (1994)  links  this 
approach  with  CM  tests.  For  a  summary  of  this  literature  see  Davidson  and  MacKinnon 
(1993,  chapter  11). 


8.5.5.  Nonnested  Models  Example 

A  sample  of  100  observations  is  generated  from  a  Poisson  model  with  mean  E[y|x]  = 
expOdi  +  P2X2  +  /63A3),  where  x2,  *3  ~  A/"[0,  1],  and  (/fi ,  £3)  =  (0.5,  0.5,  0.5). 
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Table  8.2.  Nonnested  Model  Comparisons  for  Poisson  Regression  Example a 


Test  Type 

Model  1 

Model  2 

Conclusion 

— 21n  L 

366.86 

352.18 

Model  2  preferred 

AIC 

370.86 

358.18 

Model  2  preferred 

BIC 

376.07 

366.00 

Model  2  preferred 

Ned2 

7.84  with  p  - 

=  0.000 

Can  discriminate 

Tlr  =  A-1/2LR/S 

—0.883  with 

p  =  0.377 

No  model  favored 

a  N  =  100.  Model  1  is  Poisson  regression  of  y  on  intercept  and  X2-  Model  2  is  Poisson  regression 
of  y  on  intercept,  xj,  and  xi.  The  final  two  rows  are  for  the  Vuong  test  for  nonoverlapping  models 
(see  the  text). 


The  dependent  variable  y  has  sample  mean  1.92  and  standard  deviation  1.84.  Two 
incorrect  nonnested  models  were  estimated  by  Poisson  regression: 

Model  1:  E[v|x]  =  exp(0.608  +  0.291x2), 

y  F  (8.08)  (4.03) 

Model  2:  E[v|x]  =  exp(0.493  +  0.359x,  +  0.09 lx?), 

(5.14)  (5.10)  "  (1.78) 

where  t— statistics  are  given  in  parentheses. 

The  first  three  rows  of  Table  8.2  give  various  information  criteria,  with  the  model 
with  smallest  value  preferred.  The  first  does  not  penalize  number  of  parameters  and 
favors  model  2.  The  second  and  third  measures  defined  in  (8.40)  and  (8.41)  give  larger 
penalty  to  model  2,  which  has  an  additional  parameter,  but  still  lead  to  the  larger  model 
2  being  favored. 

The  final  two  rows  of  the  Table  8.2  summarize  Vuong’s  test,  here  a  test  of  overlap¬ 
ping  models. 

First,  test  the  condition  of  equality  of  the  densities  when  evaluated  at  the  pseudo- 
true  values.  The  statistic  cd2  in  (8.48)  is  easily  computed  given  expressions  for  the 
densities.  The  difficult  part  is  computing  an  estimate  of  the  matrix  W  in  (8.45).  For 
the  Poisson  density  we  can  use  A  and  B  defined  at  the  end  of  Section  5.2.3  and 
B /  i>  =  J2i(yi  -  X  (>',-  -  jigi)x'gi-  The  eigenvalues  of  W  are  A,  =  0.29, 

A. 2  =  1.00,  A.3  =  1.06,  X-4  =  1.48,  and  A5  =  2.75.  The  p-v alue  for  the  test  statis¬ 
tic  Nor  with  distribution  given  in  (8.44)  is  obtained  as  the  proportion  of  draws  of 
Y^5j=  i  7./Z2,  say  10,000  draws,  which  exceed  Ned2  =  69.14.  Flere  p  =  0.000  <  0.05 
and  we  conclude  that  it  is  possible  to  discriminate  between  the  models.  The  critical 
value  at  level  0.05  in  this  example  equals  16.10,  quite  a  bit  higher  than  Xq5(5)  = 
11.07. 

Given  discrimination  is  possible,  then  the  second  test  can  be  applied.  Flere  TLr  = 
—0.883  favors  the  second  model,  since  it  is  negative.  Flowever,  using  a  standard  normal 
two-tail  test  at  5%  the  difference  is  not  statistically  significant.  In  this  example  cd2  is 
quite  large,  which  means  the  first  test  statistic  Ned2  is  large  but  the  second  test  statistic 
A_1/2LR(0,  7)/ Sis  small. 
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8.6.  Consequences  of  Testing 

In  practice  more  than  one  test  is  performed  before  one  reaches  a  preferred  model.  This 
leads  to  several  complications  that  practitioners  usually  ignore. 


8.6.1.  Pretest  Estimation 

The  use  of  specification  tests  to  choose  a  model  complicates  the  distribution  of  an 
estimator.  For  example,  suppose  we  choose  between  two  estimators  9  and  9  on  the 
basis  of  a  statistical  test  at  5%.  For  instance,  9  and  9  may  be  estimators  in  unrestricted 
and  restricted  models.  Then  the  actual  estimator  is  9+  =  w 9  +  (1  —  w)9,  where  the 
random  variable  w  takes  value  1  if  the  test  favors  9  and  0  if  the  test  favors  9.  In  short, 
the  estimator  depends  on  the  restricted  and  unrestricted  estimators  and  on  a  random 
variable  w,  which  in  turn  depends  on  the  significance  level  of  the  test.  Hence  9  1  is  an 
estimator  with  complex  properties.  This  is  called  a  pretest  estimator,  as  the  estimator 
is  based  on  an  initial  test.  The  distribution  of  9+  has  been  obtained  for  the  linear 
regression  model  under  normality  and  is  nonstandard. 

In  theory  statistical  inference  should  be  based  on  the  distribution  of  9  1 .  In  practice 
inference  is  based  on  the  distribution  of  9  if  w  =  1  or  of  9  if  w  =  0,  ignoring  the 
randomness  in  w.  This  is  done  for  simplicity,  as  even  in  the  simplest  models  the  dis¬ 
tribution  of  the  estimator  becomes  intractable  when  several  such  tests  are  performed. 


8.6.2.  Order  of  Testing 

Different  conclusions  can  be  drawn  according  to  the  order  in  which  tests  are  con¬ 
ducted. 

One  possible  ordering  is  from  general  to  specific  model.  For  example,  one  may 
estimate  a  general  model  for  demand  before  testing  restrictions  from  consumer  de¬ 
mand  theory  such  as  homogeneity  and  symmetry.  Or  the  cycle  may  go  from  specific 
to  general  model,  with  regressors  added  as  needed  and  additional  complications  such 
as  endogeneity  controlled  for  if  present.  Such  orderings  are  natural  when  choosing 
which  regressors  to  include  in  a  model,  but  when  specification  tests  are  also  being 
performed  it  is  not  uncommon  to  use  both  general  to  specific  and  specific  to  general 
orderings  in  the  same  study. 

A  related  issue  is  that  of  joint  versus  separate  tests.  For  example,  the  significance 
of  two  regressors  can  be  tested  by  either  two  individual  t— tests  of  significance  or  a 
joint  F— test  or  x2(2)  test  of  significance.  A  general  discussion  was  given  in  Sec¬ 
tion  7.2.7  and  an  example  is  given  later  in  Section  18.7. 


8.6.3.  Data  Mining 

Taken  to  its  extreme,  the  extensive  use  of  tests  to  select  a  model  has  been  called  data 
mining  (Fovell,  1983).  For  example,  one  may  search  among  several  hundred  possible 
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predictors  of  y  and  choose  just  those  predictors  that  are  significant  at  5%  on  a  two- 
sided  test.  Computer  programs  exist  that  automate  such  searches  and  are  commonly 
used  in  some  branches  of  applied  statistics.  Unfortunately,  such  broad  searches  will 
lead  to  discovery  of  spurious  relationships,  since  a  test  with  size  0.05  leads  to  er¬ 
roneous  findings  of  statistical  significance  5%  of  the  time.  Lovell  pointed  out  that 
the  application  of  such  a  methodology  tends  to  overestimate  the  goodness-of-fit  mea¬ 
sures  (e.g.,  R2)  and  underestimate  the  sampling  variances  of  regression  coefficients, 
even  when  it  succeeds  in  uncovering  the  variables  that  feature  in  the  data-generating 
process.  Using  standard  tests  and  reporting  p- values  without  taking  account  of  the 
model-search  procedure  is  misleading  because  nominal  and  actual  p- values  are  not 
the  same.  White  (2001b)  and  Sullivan,  Timmermann,  and  White  (2001)  show  how  to 
use  bootstrap  methods  to  calculate  the  true  statistical  significance  of  regressors.  See 
also  P.  Hansen  (2003). 

The  motivation  for  data  mining  is  sometimes  to  conserve  degrees  of  freedom  or 
to  avoid  overparameterization  (“clutter”).  More  importantly,  many  aspects  of  speci¬ 
fication,  such  as  the  functional  form  of  covariates,  are  left  unresolved  by  underlying 
theory.  Given  specification  uncertainty,  justification  exists  for  specification  searching 
(Sargan,  2001).  However,  care  needs  to  be  taken  especially  if  small  samples  are  an¬ 
alyzed  and  the  number  of  specification  searches  is  large  relative  to  the  sample  size. 
When  the  specification  search  is  sequential,  with  a  large  number  of  steps,  and  with 
each  step  determined  by  a  previous  test  outcome,  the  statistical  properties  of  the  pro¬ 
cedure  as  a  whole  are  complex  and  analytically  intractable. 


8.6.4.  A  Practical  Approach 

Applied  microeconometrics  research  generally  minimizes  the  problem  of  pretest  esti¬ 
mation  by  making  judicious  use  of  hypothesis  tests.  Economic  theory  is  used  to  guide 
the  selection  of  regressors,  to  greatly  reduce  the  number  of  potential  regressors.  If  the 
sample  size  is  large  there  is  little  purpose  served  by  dropping  “insignificant’ ’  variables. 
Final  results  often  use  regressions  that  include  statistically  insignificant  regressors  for 
control  variables,  such  as  region,  industry,  and  occupation  dummies  in  an  earnings 
regression.  Clutter  can  be  avoided  by  not  reporting  unimportant  coefficients  in  a  full 
model  specification  but  noting  that  fact  in  an  appropriate  place.  This  can  lead  to  some 
loss  of  precision  in  estimating  the  key  regressors  of  interest,  such  as  years  of  school¬ 
ing  in  an  earnings  regression,  but  guards  against  bias  caused  by  erroneously  dropping 
variables  that  should  be  included. 

Good  practice  is  to  use  only  part  of  the  sample  (“training  sample”)  for  specification 
searches  and  model  selection,  and  then  report  results  using  the  preferred  model  esti¬ 
mated  using  a  completely  separate  part  of  the  sample  (“estimation  sample”).  In  such 
circumstances  pretesting  does  not  affect  the  distribution  of  the  estimator,  if  the  sub¬ 
samples  are  independent.  This  procedure  is  usually  only  implemented  when  sample 
sizes  are  very  large,  because  using  less  than  the  full  sample  in  final  estimation  leads  to 
a  loss  in  estimator  precision. 
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8.7.  Model  Diagnostics 

In  this  section  we  discuss  goodness-of-fit  measures  and  definitions  of  residuals  in  non¬ 
linear  models.  Useful  measures  are  those  that  reveal  model  deficiency  in  some  partic¬ 
ular  dimension. 

8.7.1.  Pseudo-f?2  Measures 

Goodness  of  fit  is  interpreted  as  closeness  of  fitted  values  to  sample  values  of  the 
dependent  variable. 

For  linear  models  with  K  regressors  the  most  direct  measure  is  the  standard  error 
of  the  regression,  which  is  the  estimated  standard  deviation  of  the  error  term, 


For  example,  a  standard  error  of  regression  of  0.10  in  a  log-earnings  regression  means 
that  approximately  95%  of  the  fitted  values  are  within  0.20  of  the  actual  value  of 
log-earnings,  or  within  22%  of  actual  earnings  using  e° 1  ~  1 .22.  This  measure  is  the 
same  as  the  in-sample  root  mean  squared  error  where  is  viewed  as  a  forecast  of  of 
yi,  aside  from  a  degrees  of  freedom  correction.  Alternatively,  one  can  use  the  mean 
absolute  error  ( N  —  K )  1  JT  |y,  —  y) 1.  The  same  measures  can  be  used  for  nonlinear 
regression  models,  provided  the  nonlinear  models  lead  to  a  predicted  value  y,  of  the 
dependent  variable. 

A  related  measure  in  linear  models  is  R2,  the  coefficient  of  multiple  determina¬ 
tion.  This  explains  the  fraction  of  variation  of  the  dependent  variable  explained  by  the 
regressors.  The  statistic  R 2  is  more  commonly  reported  than  s,  even  though  s  may  be 
more  informative  in  evaluating  the  goodness  of  fit. 

A  pseudo- R2  is  an  extension  of  R2  to  nonlinear  regression  model.  There  are  several 
interpretations  of  R2  in  the  linear  model.  These  lead  to  several  possible  pseudo- A1 2 
measures  that  in  nonlinear  models  differ  and  do  not  necessarily  have  the  properties  of 
lying  between  zero  and  one  and  increasing  as  regressors  are  added.  We  present  several 
of  these  measures  that,  for  simplicity,  are  not  adjusted  for  degrees  of  freedom. 

One  approach  bases  R2  on  decomposition  of  the  total  sum  of  squares  (TSS),  with 

-  502  =  ~  y‘)2  +  ~ y)2  +  2  ~  yi)(y‘  ~  >’)• 

i  i  i  i 

The  first  sum  in  the  right-hand  side  is  the  residual  sum  of  squares  (RSS)  and  the  second 
term  is  the  explained  sum  of  squares  (ESS).  This  leads  to  two  possible  measures: 

RlES  =  1  -  RSS/TSS, 

Rgx  P  =  ESS/TSS. 

For  OLS  regression  in  the  linear  model  with  intercept  the  third  sum  equals  zero,  so 
R2es  =  R2ixp.  However,  this  simplification  does  not  occur  in  other  models  and  in  gen¬ 
eral  Rpls  i=-  RgX p  'n  nonlinear  models.  The  measure  Rpvs  can  be  less  than  zero,  Aj2x|) 
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can  exceed  one,  and  both  measures  may  decrease  as  regressors  are  added  though  A'^HS 
will  increase  for  NLS  regression  of  the  nonlinear  model  as  then  the  estimator  is  mini¬ 
mizing  RSS. 

A  closely  related  measure  uses 

Rqor  =  Cor2  [ v,- ,  Si] . 

the  squared  correlation  between  actual  and  fitted  values.  The  measure  RG 0R  lies  be¬ 
tween  zero  and  one  and  equals  R 2  in  OLS  regression  for  the  linear  model  with  inter¬ 
cept.  In  nonlinear  models  Rgor  can  decrease  as  regressors  are  added. 

A  third  approach  uses  weighted  sums  of  squares  that  control  for  the  intrinsic  het- 
eroskedasticity  of  cross-section  data.  Let  frj  be  the  fitted  conditional  variance  of  yt, 
where  it  is  assumed  that  heteroskedasticity  is  explicitly  modeled  as  is  the  case  for 
FGLS  and  for  models  such  as  logit  and  Poisson.  Then  we  can  use 


R^ss  =  1  -  WRSS/WTSS, 


where  the  weighted  residual  sum  of  squares  WRSS  =  ^(y,-  —  WTSS  = 

J2i(yi  —  T1-)2 1^2-.  and  Tl  and  g  2  are  the  estimated  mean  and  variance  in  the  intercept- 
only  model.  This  can  be  called  a  Pearson  R2  because  WRSS  equals  the  Pearson 
statistic,  which,  aside  from  any  finite-sample  corrections,  should  equal  N  if  het¬ 
eroskedasticity  is  correctly  modeled.  Note  that  R^ss  can  he  less  than  zero  and  decrease 
as  regressors  are  added. 

A  fourth  approach  is  a  generalization  of  R2  to  objective  functions  other  than  the  sum 
of  squared  residuals.  Let  Qn(O)  denote  the  objective  function  being  maximized,  Qq 
denote  its  value  in  the  intercept-only  model,  Q gt  denote  the  value  in  the  fitted  model, 
and  <2 max  denote  the  largest  possible  value  of  Qn(6).  Then  the  maximum  potential 
gain  in  the  objective  function  resulting  from  inclusion  of  regressors  is  <2max  —  Qo  and 
the  actual  gain  is  <2fit  —  Qq.  This  suggests  the  measure 


R 


2  _ 
RG  — 


Q fit  Qo  _  j  Qmax  Q tit 

Qmax  Qo  Qmax  Qo 


where  the  subscript  RG  means  relative  gain.  For  least-squares  estimation  the  loss 
function  maximized  is  minus  the  residual  sum  of  squares.  Then  Qq  =  — TSS,  <2fit  = 
—RSS,  and  <2max  =  0,  so  Rrg  =ESS/TSS  for  OLS  or  NLS  regression.  The  measure 
Rrg  has  the  advantage  of  lying  between  zero  and  one  and  increasing  as  regressors  are 
added.  For  ML  estimation  the  loss  function  is  Qn(Q)  =ln  L^{6).  Then  R^G  cannot 
always  be  used  as  in  some  models  there  may  be  no  bound  on  <2  max-  For  example,  for 
the  linear  model  under  normality  Ln(/3,<j 2)  — >-oo  as  o2^ 0.  For  ML  and  quasi-ML 
estimation  of  linear  exponential  family  models,  such  as  logit  and  Poisson,  <2max  is 
usually  known  and  Rrg  can  be  shown  to  be  an  R2  based  on  the  deviance  residuals 
defined  in  the  next  section. 

A  related  measure  to  Rrg  is  R2q  =  1  —  <2fit/<2o-  This  measure  increases  as  re¬ 
gressors  are  added.  It  equals  Rrg  if  Qmax  =  0,  which  is  the  case  for  OLS  regres¬ 
sion  and  for  binary  and  multinomial  models.  Otherwise,  for  discrete  data  this  mea¬ 
sure  may  have  upper  bound  less  than  one,  whereas  for  continuous  data  the  measure 
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may  not  be  bounded  between  zero  and  one  as  the  log-likelihood  can  be  negative  or 
positive.  For  example,  for  ML  estimation  with  continuous  density  it  is  possible  that 
<2o  =  1  and  2 m  =  4,  leading  to  R2()  =  —3,  or  that  Q0  =  —  1  and  =  4,  leading 
to  R2q  =  5. 

For  nonlinear  models  there  is  therefore  no  universal  pseudo- A12.  The  most  useful 
measures  may  be  Rgor,  as  correlation  coefficients  are  easily  interpreted,  and  R^G  in 
special  cases  that  <2  max  is  known.  Cameron  and  Windmeijer  (1997)  analyze  many  of 
the  measures  and  Cameron  and  Windmeijer  (1996)  apply  these  measures  to  count  data 
models. 


8.7.2.  Residual  Analysis 

Microeconometrics  analysis  actually  places  little  emphasis  on  residual  analysis,  com¬ 
pared  to  some  other  areas  of  statistics.  If  data  sets  are  small  then  there  is  concern  that 
residual  analysis  may  lead  to  overfitting  of  the  model.  If  the  data  set  is  large  then 
there  is  a  belief  that  residual  analysis  may  be  unnecessary  as  a  single  observation  will 
have  little  impact  on  the  analysis.  We  therefore  give  a  brief  summary.  A  more  exten¬ 
sive  discussion  is  given  in,  for  example,  McCullagh  and  Nelder  (1989)  and  Cameron 
and  Trivedi  (1998,  chapter  5).  Econometricians  have  had  particular  interest  in  defining 
residuals  in  censored  and  truncated  models. 

A  wide  range  of  residuals  have  been  proposed  for  nonlinear  regression  models. 
Consider  a  scalar  dependent  variable  y,  with  fitted  value  y,  =  /2,  =  /r(x,,0).  The  raw 
residual  is  r,  =  y,  —  /j, .  The  Pearson  residual  is  the  obvious  correction  for  het- 
eroskedasticity  p,  =  (y,  —  /2;)/ct(,  where  if,  is  an  estimate  of  the  conditional  variance 
of  y,  .  This  requires  a  specification  of  the  variance  for  y,,  which  is  done  for  models 
such  as  the  Poisson.  For  an  LEF  density  (see  Section  5.7.3)  the  deviance  residual  is 
dj  =  signiy,  —  /f, )  ^/2[Z(y,-)  —  /(/r,)],  where  l(y)  denotes  the  log-density  of  y | p  eval¬ 
uated  at  /i  =  y  and  l(Ji)  denotes  evaluation  at  p  =  p  .  A  motivation  for  the  deviance 
residual  is  that  the  sum  of  squares  of  these  residuals  is  the  deviance  statistic  that  is 
the  generalization  for  LEF  models  of  the  sum  of  raw  residuals  in  the  linear  model.  The 
Anscombe  residual  is  defined  to  be  the  transformation  of  y  that  is  closest  to  normality, 
then  standardized  to  mean  zero  and  variance  1 .  This  transformation  has  been  obtained 
for  LEF  densities. 

Small-sample  corrections  to  residuals  have  been  proposed  to  account  for  estima¬ 
tion  eiTor  in  /x(.  For  the  linear  model  this  entails  division  of  residuals  by  ^1  —  ha, 
where  ha  is  the  /th  diagonal  entry  in  the  hat  matrix  H  =  X(X'X)_1X.  These  residu¬ 
als  are  felt  to  have  better  finite-sample  performance.  Since  H  has  rank  K,  the  num¬ 
ber  of  regressors,  the  average  value  of  ha  is  K/N  and  values  of  ha  in  excess  of 
2 K/N  are  viewed  as  having  high  leverage.  These  results  extend  to  LEF  models 
with  H  =  W1/2X(X'WX)-‘XW1/2,  where  W  =  Diag[w„]  and  wit  =  g'i^fy/o2  with 
g(xj/3)  and  a2  the  specified  conditional  mean  and  variance,  respectively.  McCullagh 
and  Nelder  (1989)  provide  a  summary. 

More  generally,  Cox  and  Snell  ( 1968)  define  a  generalized  residual  to  be  any  scalar 
function  r,  =  r(y, ,  x,  ,  B)  that  satisfies  some  relatively  weak  conditions.  One  way  that 
such  residuals  arise  is  that  many  estimators  have  first-order  conditions  of  the  form 
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JT  g (xi,  6)r(yi,  x,  ,  Q)  =  0,  where  y,  appears  in  the  scalar  r(- )  but  not  in  the  vector  g(-). 
See  also  White  (1994). 

For  regression  models  based  on  a  normal  latent  variable  (see  Chapters  14  and  16) 
Chesher  and  Irish  (1987)  propose  using  E[£*|y,]  as  the  residual,  where  y*  =  /r,  +  s* 
is  the  unobserved  latent  variable  and  y,  =  g(y*)  is  the  observed  dependent  variable. 
Particular  choices  of  g(-)  correspond  to  the  probit  and  Tobit  models.  Gourieroux  et  al. 
(1987)  generalize  this  approach  to  LEF  densities.  A  natural  approach  in  this  context 
is  to  treat  residuals  as  missing  data,  along  the  lines  of  the  expectation  maximum  algo¬ 
rithm  in  Section  10.3. 

A  common  use  of  residuals  is  in  plots  against  other  variables  of  interest.  Plots  of 
residuals  against  fitted  values  can  reveal  poor  model  fit;  plots  of  residuals  against  omit¬ 
ted  variables  can  suggest  further  regressors  to  include  in  the  model;  and  plots  of  resid¬ 
uals  against  included  regressors  can  suggest  need  for  a  different  functional  form.  It  can 
be  helpful  to  include  a  nonparametric  regression  line  in  such  plots,  (see  Chapter  9).  If 
data  take  only  a  few  discrete  values  the  plots  can  be  difficult  to  interpret  because  of 
clustering  at  just  a  few  values,  and  it  can  be  helpful  to  use  a  so-called  jitter  feature  that 
adds  some  random  noise  to  the  data  to  reduce  the  clustering. 

Some  parametric  models  imply  that  an  appropriately  defined  residual  should  be 
normally  distributed.  This  can  be  checked  by  a  normal  scores  plot  that  orders  residuals 
i'i  from  smallest  to  largest  and  plots  them  against  the  values  predicted  if  the  resid¬ 
uals  were  exactly  normally  distributed.  Thus  plot  ordered  r,  against  r  +  srQ  1  ((;  — 
0.5)/N),  where  r  and  sr  are  the  sample  mean  and  standard  deviation  of  r  and  <F  1  ( ■ ) 
is  the  inverse  of  the  standard  normal  cdf. 


8.7.3.  Diagnostics  Example 

Table  8.3  uses  the  same  data-generating  process  as  in  Section  8.5.5.  The  dependent 
variable  y  has  sample  mean  1.92  and  standard  deviation  1.84.  Poisson  regression  of  y 
on  X3  and  of  y  on  .ry  and  x2  yields 

Model  1:  E[v|x]  =  exp(0.586  +  0.389x3), 

(5.20)  (7.60) 

Model  2:  E[v|x]  =  exp(0.493  +  0.359x3  +  0.091x?), 

(5.14)  (5.10)  (1.78) 

where  7-statistics  are  given  in  parentheses. 

In  this  example  all  R2  measures  increase  with  addition  of  x 2  as  regressor,  though 
by  quite  different  amounts  given  that  in  this  example  all  but  the  last  R 2  have  similar 
values.  More  generally  the  first  three  R2  are  scaled  similarly  and  Rres  and  Reor  can 
be  quite  close,  but  the  remaining  three  measures  are  scaled  quite  differently.  Only  the 
last  two  R2  measures  are  guaranteed  to  increase  as  a  regressor  is  added,  unless  the 
objective  function  is  the  sum  of  squared  errors.  The  measure  R^G  can  be  constructed 
here,  as  the  Poisson  log-likelihood  is  maximized  if  the  fitted  mean  /i)  =  y,  for  all  i, 
leading  to  <2max  =  '}Zl  [y,  In  y,  —  y,-  —  In  y, !],  where  y  In  y  =  0  when  y  =  0. 

Additionally,  three  residuals  were  calculated  for  the  second  model.  The  sample 
mean  and  standard  deviation  of  residuals  were,  respectively,  0  and  1.65  for  the  raw 
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Table  8.3.  Pseudo  R2s:  Poisson  Regression  Example a 


Diagnostic 

Model  1 

Model  2 

Difference 

5  where  .v2  =  RSS/(N-K) 

0.1662 

0.1661 

0.0001 

R2ES  =  1  -  RSS/TSS 

0.1885 

0.1962 

+0.0077 

REXP  =  ESS/TSS 

0.1667 

0.2087 

+0.0402 

^cor  =  C°r2  [>,  Ji] 

0.1893 

0.1964 

+0.0067 

R^  =  1  -  WRSS/WTSS 

0.1562 

0.1695 

+0.0233 

^RG  =  (Qfit  Qo)/{Qmax  f2o) 

0.1552 

0.1712 

+0.0160 

Rq  =  1- £W<2o 

0.0733 

0.0808 

+0.0075 

a  N  =  100.  Model  1  is  Poisson  regression  of  y  on  intercept  and  *3.  Model  2  is  Poisson  regression  of  y 
on  intercept,  *3,  and  x\.  RSS  is  residual  sum  of  squares  (SS),  ESS  is  explained  SS,  TSS  is  total  sum 
of  squares,  WRSS  is  weighted  RSS,  WTSS  is  weighted  TSS,Qgt  is  fitted  value  of  objective  function, 
Q q  is  fitted  value  in  intercept-only  model,  and  Qmax  is  the  maximum  possible  value  of  the  objective 
function  given  the  data  and  exists  only  for  some  objective  functions. 


residuals,  0.01  and  1 .97  for  the  Pearson  residuals,  and  —0.21  and  1 .22  for  the  deviance 
residuals.  The  zero  mean  for  the  raw  residual  is  a  property  of  Poisson  regression  with 
intercept  included  that  is  shared  by  very  few  other  models.  The  larger  standard  devia¬ 
tion  of  the  raw  residuals  reflects  the  lack  of  scaling  and  the  fact  that  here  the  standard 
deviation  of  y  exceeds  1 .  The  correlations  between  pairs  of  these  residuals  all  exceed 
0.96.  This  is  likely  to  happen  when  R 2  is  low  so  that  yj  —  y. 

8.8.  Practical  Considerations 

m-Tests  and  Hausman  tests  are  most  easily  implemented  by  use  of  auxiliary  regres¬ 
sions.  One  should  be  aware  that  these  auxiliary  regressions  may  be  valid  only  under 
distributional  assumptions  that  are  stronger  than  those  made  to  obtain  the  usual  robust 
standard  errors  of  regression  coefficients.  Some  robust  tests  have  been  presented  in 
Section  8.4. 

With  a  large  enough  data  set  and  fixed  significance  level  such  as  5%  the  sample  mo¬ 
ment  conditions  implied  by  a  model  will  be  rejected,  except  in  the  unrealistic  case  that 
all  aspects  of  the  model-functional  form,  regressors,  and  distribution  -  are  correctly 
specified.  In  classical  testing  situations  this  is  often  a  desired  result.  In  particular,  with 
a  large  enough  sample,  regression  coefficients  will  always  be  significantly  different 
from  zero  and  many  studies  seek  such  a  result.  However,  for  specification  tests  the 
desire  is  usually  to  not  reject,  so  that  one  can  say  that  the  model  is  correctly  specified. 
Perhaps  for  this  reason  specification  tests  are  under-utilized. 

As  an  illustration,  consider  tests  of  correct  specification  of  life-cycle  models  of 
consumption.  Unless  samples  are  small  a  dedicated  specification  tester  is  likely  to 
reject  the  model  at  5%.  For  example,  suppose  a  model  specification  test  statistic 
is  /2(12)  distributed  when  applied  to  a  sample  with  N  =3,000  has  a  p- value  of 
0.02.  It  is  not  clear  that  the  life-cycle  model  is  providing  a  poor  explanation  of  the 
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data,  even  though  it  would  be  formally  rejected  at  the  5%  significance  level.  One 
possibility  is  to  increase  the  critical  value  as  sample  size  increases  using  BIC  (see 
Section  8.5.1). 

Another  reason  for  underutilization  of  specification  tests  is  difficulty  in  computation 
and  poor  size  property  of  tests  when  more  convenient  auxiliary  regressions  are  used 
to  implement  an  asymptotically  equivalent  version  of  a  test.  These  drawbacks  can  be 
greatly  reduced  by  use  of  the  bootstrap.  Chapter  11  presents  bootstrap  methods  to 
implement  many  of  the  tests  given  in  this  chapter. 

8.9.  Bibliographic  Notes 

8.2  The  conditional  moment  test,  due  to  Newey  (1985)  and  Tauchen  (1985),  is  a  generalization 
of  the  information  matrix  test  of  White  (1982).  For  ML  estimation,  the  computation  of  the 
m-test  by  auxiliary  regression  generalizes  methods  of  Lancaster  (1984)  and  Chesher  (1984) 
for  the  IM  test.  A  good  overview  of  m-tests  is  given  in  Pagan  and  Vella  (1989).  The  m-test 
provides  a  very  general  framework  for  viewing  testing.  It  can  be  shown  to  nest  all  tests, 
such  as  Wald,  LM,  LR,  and  Hausman  tests.  This  unifying  element  is  emphasized  in  White 
(1994). 

8.3  The  Hausman  test  was  proposed  by  Hausman  (1978),  with  earlier  references  already  given 
in  Section  8.3  and  a  good  survey  provided  by  Ruud  (1984). 

8.4  The  econometrics  texts  by  Greene  (2003),  Davidson  and  McKinnon  (1993)  and  Wooldridge 
(2002)  present  many  of  the  standard  specification  tests. 

8.5  Pesaran  and  Pesaran  (1993)  discuss  how  the  Cox  (1961,  1962b)  nonnested  test  can  be 
implemented  when  an  analytical  expression  for  the  expectation  of  the  log-likelihood  is  not 
available.  Alternatively,  the  test  of  Vuong  (1989)  can  be  used. 

8.7  Model  diagnostics  for  nonlinear  models  are  often  obtained  by  extension  of  results  for  the 
linear  regression  model  to  generalized  linear  models  such  as  logit  and  Poisson  models.  A 
detailed  discussion  with  references  to  the  literature  is  given  in  Cameron  and  Trivedi  (1998, 
Chapter  5). 


- Exercises - 

8-1  Suppose  y  =  x'/3  +  u ,  where  u  ~  AT[0,cr2],  with  parameter  vector  d  =  [f3',  a2]  and 
density  f(y\0)  =  (1/V27r<r)exp[-(y-  x'/3)2/2a2].  We  have  a  sample  of  N  inde¬ 
pendent  observations. 

(a)  Explain  why  a  test  of  the  moment  condition  E[x(y-  x'/3)3]  is  a  test  of  the 
assumption  of  normally  distributed  errors. 

(b)  Give  the  expressions  for  m,  and  s,-  given  in  (8.5)  necessary  to  implement  the 
m-test  based  on  the  moment  condition  in  part  (a). 

(c)  Suppose  dim[x]  =10,  N  =  100,  and  the  auxiliary  regression  in  (8.5)  yields  an 
uncentered  Ft2  of  0.2.  What  do  you  conclude  at  level  0.05? 

(d)  For  this  example  give  the  moment  conditions  tested  by  White’s  information 
matrix  test. 

8-2  Consider  the  multinomial  version  of  the  PCGF  test  given  in  (8.23)  with  pj  replaced 
by  p;  =  AC1  Fj(*i,  6).  Show  that  PCGF  can  be  expressed  as  CGF  in  (8.27) 
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with  V  =  Diag[A/py].  [Conclude  that  in  the  multinomial  case  Andrew’s  test  statistic 
simplifies  to  Pearson’s  statistic.] 

8-3  (Adapted  from  Amemiya,  1985).  For  the  Hausman  test  given  in  Section  8.4.1  let 
Vn  =  V[0],  V22  =  V[0],  and  V12  =  Cov[?,  0], 

(a)  Show  that  the  estimator  0  =  0  +  [Vn  +  V22  -  2V12]_1(0,  8)  has  asymptotic 
variance  matrix  V[0]  =  Vn  -  [Vn  -  V12][Vn  +  V22  -  2V12]-1[V11  -  V12], 

(b)  Hence  show  that  V[0]  is  less  than  V[0]  in  the  matrix  sense  unless  Cov[0,  8]  = 
V[?]. 

(c)  Now  suppose  that  8  is  fully  efficient.  Can  V[0]  be  less  than  V[0]?  What  do 
you  conclude? 

8-4  Suppose  that  two  models  are  non-nested  and  there  are  N  =  200  observations. 
For  model  1,  the  number  of  parameters  q  =  10  and  In  L  =  -400.  For  model  3, 
q  =  10  and  In  L  =  -380. 

(a)  Which  model  is  favored  using  AIC? 

(b)  Which  model  is  favored  using  BIC? 

(c)  Which  model  would  be  favored  if  the  models  were  actually  nested  and  we 
used  a  likelihood  ratio  test  at  level  0.05? 

8-5  Use  the  health  expenditure  data  of  Section  16.6.  The  model  is  a  probit  regres¬ 
sion  of  DMED,  an  indicator  variable  for  positive  health  expenditures,  against  the 
17  regressors  listed  in  the  second  paragraph  of  Section  16.6.  You  should  obtain 
the  estimates  given  in  the  first  column  of  Table  16.1 . 

(a)  Test  the  joint  statistical  significance  of  the  self-rated  health  indicators  HLTHG, 
HLTHF,  and  HLTHP  at  level  0.05  using  a  Hausman  test.  [This  may  require 
some  additional  coding,  depending  on  the  package  used.] 

(b)  Is  the  Hausman  test  the  best  test  to  use  here? 

(c)  Does  an  information  matrix  test  at  level  0.05  support  the  restrictions  of  this 
model?  [This  will  require  some  additional  coding.] 

(d)  Discriminate  between  a  model  that  drops  HLTHG,  HLTHF,  and  HLTHP  and  a 
model  that  drops  LC,  IDP,  and  LPI  on  the  basis  of  ftpES,  F?§xp,  ^com  anc* 
Arg- 
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Semiparametric  Methods 


9.1.  Introduction 

In  this  chapter  we  present  methods  for  data  analysis  that  require  less  model  specifica¬ 
tion  than  the  methods  of  the  preceding  chapters. 

We  begin  with  nonparametric  estimation.  This  makes  very  minimal  assumptions 
regarding  the  process  that  generated  the  data.  One  leading  example  is  estimation  of 
a  continuous  density  using  a  kernel  density  estimate.  This  has  the  attraction  of  pro¬ 
viding  a  smoother  estimate  than  the  familiar  histogram.  A  second  leading  example 
is  nonparametric  regression,  such  as  kernel  regression,  on  a  scalar  regressor.  This 
places  a  flexible  curve  on  an  (x,  _v)  scatterplot  with  no  parametric  restrictions  on  the 
form  of  the  curve.  Nonparametric  estimates  have  numerous  uses,  including  data  de¬ 
scription,  exploratory  analysis  of  data  and  of  fitted  residuals  from  a  regression  model, 
and  summary  across  simulations  of  parameter  estimates  obtained  from  a  Monte  Carlo 
study. 

Econometric  analysis  emphasizes  multivariate  regression  of  a  scalar  y  on  a  vector 
of  regressors  x.  However,  nonparametric  methods,  although  theoretically  possible  with 
an  infinitely  large  sample,  break  down  in  practice  because  the  data  need  to  be  sliced  in 
several  dimensions,  leading  to  too  few  data  points  in  each  slice. 

As  a  result  econometricians  have  focused  on  semiparametric  methods.  These  com¬ 
bine  a  parametric  component,  greatly  reducing  the  dimensionality,  with  a  nonpara¬ 
metric  component.  One  important  application  is  to  permit  more  flexible  models  of  the 
conditional  mean.  For  example,  the  conditional  mean  E[y  |x|  may  be  parameterized  to 
be  of  the  single-index  form  g(x'/3),  where  the  functional  form  for  g(-)  is  not  specified 
but  is  instead  nonparametrically  estimated,  along  with  the  unknown  parameters  (3.  An¬ 
other  important  application  relaxes  distributional  assumptions  that  if  misspecified  lead 
to  inconsistent  parameter  estimates.  For  example,  we  may  wish  to  obtain  consistent 
estimates  of  /3  in  a  linear  regression  model  y  =  x'fi  +  s  when  data  on  y  are  trun¬ 
cated  or  censored  (see  Chapter  16),  without  having  to  correctly  specify  the  particular 
distribution  of  the  error  term  e. 
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The  asymptotic  theory  for  nonparametric  methods  differs  from  that  for  more  para¬ 
metric  methods.  Estimates  are  obtained  by  cutting  the  data  into  ever  smaller  slices  as 
N  — >  oo  and  estimating  local  behavior  within  each  slice.  Since  less  than  N  observa¬ 
tions  are  being  used  in  estimating  each  slice  the  convergence  rate  is  slower  than  that 
obtained  in  the  preceding  chapters.  Nonetheless,  in  the  simplest  cases  nonparamet¬ 
ric  estimates  are  still  asymptotically  normally  distributed.  In  some  leading  cases  of 
semiparametric  regression  the  estimators  of  parameters  (3  have  the  usual  property  of 
converging  at  rate  N “1//2,  so  that  scaling  by  s/~N  leads  to  a  limit  normal  distribution, 
whereas  the  nonparametric  component  of  the  model  converges  at  a  slower  rate  N  r , 
r  <  1/2. 

Because  nonparametric  methods  are  local  averaging  methods,  different  choices  of 
localness  lead  to  different  finite-sample  results.  In  some  restrictive  cases  there  are  rules 
or  methods  to  determine  the  bandwidth  or  window  width  used  in  local  averaging,  just 
as  there  are  rules  for  determining  the  number  of  bins  in  a  histogram  given  the  number 
of  observations.  In  addition,  it  is  common  practice  to  use  the  nonscientific  method  of 
choosing  the  bandwidth  that  gives  a  graph  that  to  the  eye  looks  reasonably  smooth  yet 
is  still  capable  of  picking  up  details  in  the  relationship  of  interest. 

Nonparametric  methods  form  the  bulk  of  this  chapter,  both  because  they  are  of 
intrinsic  interest  and  because  they  are  an  essential  input  for  semiparametric  methods, 
presented  most  notably  in  the  chapters  on  discrete  and  censored  dependent- variable 
models.  Kernel  methods  are  emphasized  as  they  are  relatively  simple  to  present  and 
because  “It  is  argued  that  all  smoothing  methods  are  in  an  asymptotic  sense  essentially 
equivalent  to  kernel  smoothing”  (Hardle,  1990,  p.  xi). 

Section  9.2  provides  examples  of  nonparametric  density  estimation  and  nonpara¬ 
metric  regression  applied  to  data.  Kernel  density  estimation  is  presented  in  Section 
9.3.  Local  regression  is  discussed  in  Section  9.4,  to  provide  motivation  for  the  formal 
treatment  of  kernel  regression  given  in  Section  9.5.  Section  9.6  presents  nonparamet¬ 
ric  regression  methods  other  than  kernel  methods.  The  vast  topic  of  semiparametric 
regression  is  then  introduced  in  Section  9.7. 

9.2.  Nonparametric  Example:  Hourly  Wage 

As  an  example  we  consider  the  hourly  wage  and  education  for  175  women  aged 
36  years  who  worked  in  1993.  The  data  are  from  the  Michigan  Panel  Survey  of  In¬ 
come  Dynamics.  It  is  easily  established  that  the  distribution  of  the  hourly  wage  is 
right-skewed  and  so  we  model  In  wage,  the  natural  logarithm  of  the  hourly  wage. 

We  give  just  one  example  of  nonparametric  density  estimation  and  one  of  nonpara¬ 
metric  regression  and  illustrate  the  important  role  of  bandwidth  selection.  Sections  9.3 
to  9.6  then  provide  the  underlying  theory. 

9.2.1.  Nonparametric  Density  Estimate 

A  histogram  of  the  natural  logarithm  of  wage  is  given  in  Figure  9. 1 .  To  provide  detail 
the  bin  width  is  chosen  so  that  there  are  30  bins,  each  of  width  about  0.20.  This  is  an 
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Histogram  for  Log  Wage 


Log  Hourly  Wage 

Figure  9.1:  Histogram  for  natural  logarithm  of  hourly  wage.  Data  for  175  U.S.  women  aged 
36  years  who  worked  in  1993. 


unusually  narrow  bin  width  for  only  175  observations,  but  many  details  are  lost  with 
a  larger  bin  width.  The  log-wage  data  seem  to  be  reasonably  symmetric,  though  they 
are  possibly  slightly  left-skewed. 

The  standard  smoothed  nonparametric  density  estimate  is  the  kernel  density  esti¬ 
mate  defined  in  (9.3).  Here  we  use  the  Epanechnikov  kernel  defined  in  Table  9.1. 

The  essential  decision  in  implementation  is  the  choice  of  bandwidth.  For  this  ex¬ 
ample  Silverman’s  plug-in  estimate  defined  in  (9.13)  yields  bandwidth  of  h  =  0.545. 
Then  the  kernel  estimate  is  a  weighted  average  of  those  observations  that  have  log 
wage  within  0.21  units  of  the  log  wage  at  the  current  point  of  evaluation,  with  great¬ 
est  weight  placed  on  data  closest  to  the  current  point  of  evaluation.  Figure  9.2  presents 
three  kernel  density  estimates,  with  bandwidths  of  0.273, 0.545  and  1 .09 1 ,  respectively 


Density  Estimates  as  Bandwidth  Varies 
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Figure  9.2:  Kernel  density  estimates  for  log  wage  for  three  different  bandwidths  using  the 
Epanechnikov  kernel.  The  plug-in  bandwidth  is  h  =  0.545.  Same  data  as  Figure  9.1 . 
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corresponding  to  one-half  the  plug-in,  the  plug-in,  and  two  times  the  plug-in  band¬ 
width.  Clearly  the  smallest  bandwidth  is  too  small  as  it  leads  to  too  jagged  a  density  es¬ 
timate.  The  largest  bandwidth  oversmooths  the  data.  The  middle  bandwidth,  the  plug¬ 
in  value  of  0.545,  seems  the  best  choice.  It  gives  a  reasonably  smooth  density  estimate. 

What  might  we  do  with  this  kernel  density  estimate?  One  possibility  is  to  compare 
the  density  to  the  normal,  by  superimposing  a  normal  density  with  mean  equal  to  the 
sample  mean  and  variance  equal  to  the  sample  variance.  The  graph  is  not  reproduced 
here  but  reveals  that  the  kernel  density  estimate  with  preferred  bandwidth  0.545  is  con¬ 
siderably  more  peaked  than  the  normal.  A  second  possibility  is  to  compare  log-wage 
kernel  density  estimates  for  different  subgroups,  such  as  by  educational  attainment  or 
by  full-time  or  part-time  work  status. 


9.2.2.  Nonparametric  Regression 

We  consider  the  relationship  between  log  wage  and  education.  The  nonparametric 
method  used  here  is  the  Lowess  local  regression  method,  a  local  weighted  average 
estimator  (see  Equation  (9.16)  and  Section  9.6.2). 

A  local  weighted  regression  line  at  each  point  x  is  fitted  using  centered  subsets  that 
include  the  closest  0.8 /V  observations,  the  program  default,  where  N  is  the  sample 
size,  and  the  weights  decline  as  we  move  away  from  x.  For  values  of  x  near  the  end 
points,  smaller  uncentered  subsets  are  used. 

Figure  9.3  gives  a  scatter  plot  of  log  wage  against  education  and  three  Fowess 
regression  curves  for  bandwidths  of  0.8,  0.4  and  0.1.  The  first  two  bandwidths  give 
similar  curves.  The  relationship  appears  to  be  quadratic,  but  this  may  be  speculative  as 
the  data  are  relatively  sparse  at  low  education  levels,  with  less  than  10%  of  the  sample 
having  less  than  10  years  of  schooling.  For  the  majority  of  the  data  a  linear  relationship 
may  also  work  well.  For  simplicity  we  have  not  presented  95%  confidence  intervals  or 
bands  that  might  also  be  provided. 
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Figure  9.3:  Nonparametric  regression  of  log  wage  on  education  for  three  different  band- 
widths  using  Lowess  regression.  Same  sample  as  Figure  9.1. 
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9.3.  Kernel  Density  Estimation 

Nonparametric  density  estimates  are  useful  for  comparison  across  different  groups  and 
for  comparison  to  a  benchmark  density  such  as  the  normal.  Compared  to  a  histogram 
they  have  the  advantage  of  providing  a  smoother  density  estimate.  A  key  decision, 
analogous  to  choosing  the  number  of  bins  in  a  histogram,  is  bandwidth  choice.  We 
focus  on  the  standard  nonparametric  density  estimator,  the  kernel  density  estimator.  A 
detailed  presentation  is  given  as  results  also  relevant  for  regression  are  more  simply 
obtained  for  density  estimation. 


9.3.1.  Histogram 

A  histogram  is  an  estimate  of  the  density  formed  by  splitting  the  range  of  x  into 
equally  spaced  intervals  and  calculating  the  fraction  of  the  sample  in  each  interval. 

We  give  a  more  formal  presentation  of  the  histogram,  one  that  extends  naturally  to 
the  smoother  kernel  density  estimator.  Consider  estimation  of  the  density  fix o)  of  a 
scalar  continuous  random  variable  x  evaluated  at  xq.  Since  the  density  is  the  derivative 
of  the  cdf  F(x o)  (i.e.,  /(x o)  =  dF(x0)/dx)  we  have 


fix  o)  =  lim 
h— >0 


F(xo  +  h)  -  F Oo  -  h) 
2 h 


Pr  [.ro  —  h  <  x  <  x o  +  /;] 

=  lim - 

A->o  2  h 


For  a  sample  {x,-,  /  =  1,  . . . ,  N]  of  size  N,  this  suggests  using  the  estimator 


1  N 

/histU'o)  =  —  ^2 
N  i=i 


l(xo  —  h  <  Xi  <  xq  +  h) 
2 h  ’ 


where  the  indicator  function 


1(A)  = 


1 

0 


if  event  A  occurs, 
otherwise. 


(9.1) 


The  estimator  /hist(ao)  is  a  histogram  estimate  centered  at  xo  with  bin  width  2 h,  since 
it  equals  the  fraction  of  the  sample  that  lies  between  xo  —  h  and  xo  +  h  divided  by  the 
bin  width  2 h.  If  /hist  is  evaluated  over  the  range  of  x  at  equally  spaced  values  of  x, 
each  2 h  units  apart,  it  yields  a  histogram. 

The  estimator  /hist(ao)  gives  all  observations  in  xo  ±  h  equal  weight  as  is  clear 
from  rewriting  (9.1)  as 


./hist(xo)  = 


(9.2) 


This  leads  to  a  density  estimate  that  is  a  step  function,  even  if  the  underlying  density 
is  continuous.  Smoother  estimates  can  be  obtained  by  using  weighting  functions  other 
than  the  indicator  function  chosen  here. 
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9.3.2.  Kernel  Density  Estimator 

The  kernel  density  estimator,  introduced  by  Rosenblatt  (1956),  generalizes  the  his¬ 
togram  estimate  (9.2)  by  using  an  alternative  weighting  function,  so 


fix  o)  = 


I  N 

— 'y  k 

NhU 


(9.3) 


The  weighting  function  K(- )  is  called  a  kernel  function  and  satisfies  restrictions  given 
in  the  next  section.  The  parameter  h  is  a  smoothing  parameter  called  the  bandwidth, 
and  two  times  h  is  the  window  width.  The  density  is  estimated  by  evaluating  /(x o)  at 
a  wider  range  of  values  of  xo  than  used  in  forming  a  histogram;  usually  evaluation  is 
at  the  sample  values  xi, . . . ,  x..y.  This  also  helps  provide  a  density  estimate  smoother 
than  a  histogram. 


9.3.3.  Kernel  Functions 

The  kernel  function  K(-)  is  a  continuous  function,  symmetric  around  zero,  that  inte¬ 
grates  to  unity  and  satisfies  additional  boundedness  conditions.  Following  Fee  (1996) 
we  assume  that  the  kernel  satisfies  the  following  conditions: 

(i)  K(z)  is  symmetric  around  0  and  is  continuous. 

(ii)  /  K(z)dz  =  1,  / zK{z)dz  =  0,  and  f  \K(z)\dz  <  oo. 

(ill)  Either  (a)  K(z)  —  0  if  |z|  >  zo  for  some  zo  or  (b)  \z\K(z)  -+  0  as  |z|  -»■  oo. 

(iv)  f  z2K(z)dz  —  k,  where  k  is  a  constant. 

In  practice  kernel  functions  work  better  if  they  satisfy  condition  (iiia)  rather  than 
just  the  weaker  condition  (iiib).  Then  restricting  attention  to  the  interval  [—  1 .  1]  rather 
than  [— zo,  Zo]  is  simply  a  normalization  for  convenience,  and  usually  K (?)  is  restricted 
toz  €  [-1,  1], 

Some  commonly  used  kernel  functions  are  given  in  Table  9.1.  The  uniform  kernel 
uses  the  same  weights  as  a  histogram  of  bin  width  2 h,  except  that  it  produces  a  running 
histogram  that  is  evaluated  at  a  series  of  points  xo  rather  than  using  fixed  bins.  The 
Gaussian  kernel  satisfies  (iiib)  rather  than  (iiia)  because  it  does  not  restrict  z  e  [—  1 ,  1]. 
A  /Uh -order  kernel  is  one  whose  first  nonzero  moment  is  the  /nil  moment.  The  first 
seven  kernels  are  of  second  order  and  satisfy  the  second  condition  in  (ii).  The  last 
two  kernels  are  fourth-order  kernels.  Such  higher  order  kernels  can  increase  rates  of 
convergence  if  /(x)  is  more  than  twice  differentiable  (see  Section  9.3.10),  though  they 
can  take  negative  values.  Table  9.1  also  gives  the  parameter  S,  defined  in  (9.11)  and 
used  in  Section  9.3.6  to  aid  bandwidth  choice,  for  some  of  the  kernels. 

Given  K(- )  and  h  the  estimator  is  very  simple  to  implement.  If  the  kernel  estimator 
is  evaluated  at  r  distinct  values  of  xo  then  computation  of  the  kernel  estimator  requires 
at  most  Nr  operations,  when  the  kernel  has  unbounded  support.  Considerable  compu¬ 
tational  savings  on  this  are  possible;  see,  for  example,  Hardle  (1990,  p.  35). 
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Table  9.1.  Kernel  Functions:  Commonly  Used  Examples a 


Kernel 

Kernel  Function  K(z ) 

6 

Uniform  (or  box  or  rectangular) 

5  X  l(|z|  <  1) 

1.3510 

Triangular  (or  triangle) 

(1  -  \z\)  X  l(|z|  <  1) 

- 

Epanechnikov  (or  quadratic) 

|(1-Z2)X  l(|z|  <  1) 

1.7188 

Quartic  (or  biweight) 

jf(l-z2)2xl(|z|  <  1) 

2.0362 

Triweight 

|(1  -Z2)3  X  l(|z|  <  1) 

2.3122 

Tricubic 

ird  -  Izl3)3  x  Klzl  <  D 

- 

Gaussian  (or  normal) 

(27T)-1/2  exp(— z2/2) 

0.7764 

Fourth-order  Gaussian 

?(3  —  z)2(2tt  )~ll2  exp(— z2/2) 

- 

Fourth-order  quartic 

|(3  -  10z2  +  7z4)  x  l(|z|  <  1) 

- 

“  The  constant  8  is  defined  in  (9.11)  and  is  used  to  obtain  Silverman's  plug-in  estimate  given  in  (9.13). 


9.3.4.  Kernel  Density  Example 

The  key  choice  of  bandwidth  h  has  already  been  illustrated  in  Figure  9.2. 

Here  we  illustrate  the  choice  of  kernel  using  generated  data,  a  random  sample  of 
size  100  drawn  from  the  A/"[0,  252]  distribution.  For  the  particular  sample  drawn  the 
sample  mean  is  2.81  and  the  sample  standard  deviation  is  25.27. 

Figure  9.4  shows  the  effect  of  using  different  kernels.  For  Epanechnikov,  Gaussian, 
quartic  and  uniform  kernels,  Silverman’s  plug-in  estimate  given  in  (9.13)  yields  band- 
widths  of,  respectively,  0.545,  0.246,  0.246,  and  0.214.  The  resulting  kernel  density 
estimates  are  very  similar,  even  for  the  uniform  kernel  which  produces  a  running 
histogram.  The  variation  in  density  estimate  with  kernel  choice  is  much  less  than  the 
variation  with  bandwidth  choice  evident  in  Figure  9.2. 


<u 

Si 


Density  Estimates  as  Kernel  Varies 


Log  Hourly  Wage 


Figure  9.4:  Kernel  density  estimates  for  log  wage  for  four  different  kernels  using  the  corre¬ 
sponding  Silverman’s  plug-in  estimate  for  bandwidth.  Same  data  as  Figure  9.1. 
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9.3.5.  Statistical  Inference 

We  present  the  distribution  of  the  kernel  density  estimator  f(x )  for  given  choice  of 
K(-)  and  /;,  assuming  the  data  x  are  iid.  The  estimate  fix)  is  biased.  This  bias  goes  to 
zero  asymptotically  if  the  bandwidth  h  0  as  IV  oo,  so  fix)  is  consistent.  How¬ 
ever,  the  bias  term  does  not  necessarily  disappear  in  the  asymptotic  normal  distribution 
for  fix),  complicating  statistical  inference. 


Mean  and  Variance 

The  mean  and  variance  of  fix  o)  are  obtained  in  Section  9.8.1 ,  assuming  that  the  second 
derivative  of  fix)  exists  and  is  bounded  and  that  the  kernel  satisfies  J  zKiz)dz  =  0, 
as  assumed  in  property  (ii)  of  Section  9.3.3. 

The  kernel  density  estimator  is  biased  with  bias  term  bix o)  that  depends  on  the 
bandwidth,  the  curvature  of  the  true  density,  and  the  kernel  used  according  to 

bix  o)  =  E[/(x0)]  -  fix  o)  =  \h2f"ix  o)  j  z2Kiz)dz.  (9.4) 

The  kernel  estimator  is  biased  of  size  OQi2),  where  we  use  the  order  of  magnitude 
notation  that  a  function  aih)  is  Oihk)  if  aih)/hk  is  finite.  The  bias  disappears  asymp¬ 
totically  if  h  — >  0  as  N  — >  oo. 

Assuming  h  — >  0  and  N  — »•  oo,  the  variance  of  the  kernel  density  estimator  is 

V[/(jc0)]  =  o)  f  Kizfdz  +  o  f  ,  (9.5) 

where  a  function  aih)  is  oihk)  if  aih)/ hk  — »■  0.  The  variance  depends  on  the  sample 
size,  bandwidth,  the  true  density,  and  the  kernel.  The  variance  disappears  if  Nh  — »■  oo, 
which  requires  that  while  >  0  it  must  do  so  at  a  slower  rate  than  IV  — >  oo. 


Consistency 

The  kernel  estimator  is  pointwise  consistent,  that  is,  consistent  at  a  particular  point 
x  =  xq,  if  both  the  bias  and  variance  disappear.  This  is  the  case  if  h  — >  0  and 
Nh  — »■  oo. 

For  estimation  of  fix)  at  all  values  of  x  the  stronger  condition  of  uniform  conver¬ 
gence,  that  is,  supYo  |  fix  o)  —  fix  o)|  -4-  0,  can  be  shown  to  occur  if  Nh/  In  N  — >•  oo. 
This  requires  h  larger  than  for  pointwise  convergence. 


Asymptotic  Normality 

The  preceding  results  show  that  asymptotically  fix o)  has  mean  fix o)  +  bix o)  and 
variance  iNh)~l  fix  o)  /  Kiz)2dz-  It  follows  that  if  a  central  limit  theorem  can  be 
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applied,  the  kernel  density  estimator  has  limit  distribution 


VM(f(x o)  -  f(x o)  -  b{x o))  4  Af 


0, 


/f(z)2dz 


(9.6) 


The  central  limit  theorem  applied  is  a  nonstandard  one  and  requires  condition  (iv);  see, 
for  example,  Lee  (1996,  p.  139)  or  Pagan  and  Ullah  (1999,  p.  40). 

It  is  important  to  note  the  presence  of  the  bias  term  b(x o),  defined  in  (9.4).  For 
typical  choices  of  bandwidth  this  term  does  not  disappear,  complicating  computation 
of  confidence  intervals  (presented  in  Section  9.3.7). 


9.3.6.  Bandwidth  Choice 

The  choice  of  bandwidth  h  is  much  more  important  than  choice  of  kernel  function 
K().  There  is  a  tension  between  setting  h  small  to  reduce  bias  and  setting  h  large  to 
ensure  smoothness.  A  natural  metric  to  use  is  therefore  mean-squared  error  (MSE), 
the  sum  of  bias  squared  and  variance. 

From  (9.4)  the  bias  is  0(h2)  and  from  (9.5)  the  variance  is  0((Nh)~l).  Intu¬ 
itively  MSE  is  minimized  by  choosing  h  so  that  bias  squared  and  variance  are  of  the 
same  order,  so  h 4  =  which  implies  the  optimal  bandwidth  h  =  0(N  0  2 )  and 

(Nh  =  O(N04).  We  now  give  a  more  formal  treatment  that  includes  a  practical  plug¬ 
in  estimate  for  h. 


Mean  Integrated  Squared  Error 

A  local  measure  of  the  performance  of  the  kernel  density  estimate  at  xo  is  the  MSE 

MSE[/(x0)]  =  E[(/(x0)  -  /(x0))2],  (9.7) 


where  the  expectation  is  with  respect  to  the  density  fix).  Since  MSE  equals  variance 
plus  squared  bias,  (9.4)  and  (9.5)  yield  the  MSE  of  the  kernel  density  estimate 


MSE[/(x0)] 


I 


K{zfdz  ■ 


l-h2f\x0)  J  z2K(z)dz 


(9.8) 


To  obtain  a  global  measure  of  performance  at  all  values  of  xo  we  begin  by  defining 

the  integrated  squared  error  (ISE) 

ISE(/7)  =  J (f{x o)  -  f(x0))2dx0,  (9.9) 

the  continuous  analogue  of  summing  squared  error  over  all  xo  in  the  discrete  case. 
This  is  written  as  a  function  of  h  to  emphasize  dependence  on  the  bandwidth.  We  then 
eliminate  the  dependence  of  fix o)  on  x  values  other  than  xo  by  taking  the  expected 
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value  of  the  ISE  with  respect  to  the  density  /(x).  This  yields  the  mean  integrated 
squared  error  (MISE), 


MISE(ft)  =  E  [ISE  (h)] 


=  E 


,f(xo))2dxo 


f 

/ 


E[(?(jco)  -  fix 0))2]dx0 
MSE[/(x0)]dxo, 


where  MSE[/(x)]  is  defined  in  (9.8).  From  the  preceding  algebra  MISE  equals  the 

integrated  mean-squared  error  (IMSE). 


Optimal  Bandwidth 

The  optimal  bandwidth  minimizes  MISE.  Differentiating  MISE(fi)  with  respect  to  h 
and  setting  the  derivative  to  zero  yields  the  optimal  bandwidth 


h*  =  S 


-0.2 

f"(x0)2dx  oj  N~02, 


where  S  depends  on  the  kernel  function  used,  with 

s_(  fK(z)2dz  \ 
\(f  z2K(z)dz)2  ) 


(9.10) 


(9.11) 


This  result  is  due  to  Silverman  (1986). 

Since  h*  =  0(N~°'2),  we  have  h*  —*■  0  as  N  ^  oo  and  Nh*  =  0(N°'8)  — >  oo 
as  required  for  consistency.  The  bias  in  /(x o)  is  0(h*2)  =  O(N~0A),  which  disap¬ 
pears  as  ./V  oo.  For  a  histogram  estimate  it  can  be  shown  that  h*  =  0(N  0  2 ) 
and  MISE(fi*)  =  0{N~2^),  inferior  to  MISE(/i*)  =  0(N  4^)  for  the  kernel  density 
estimate. 

The  optimal  bandwidth  depends  on  the  curvature  of  the  density,  with  h*  lower  if 
f(x)  is  highly  variable. 


Optimal  Kernel 

The  optimal  bandwidth  varies  with  the  kernel  (see  (9.10)  and  (9.11)).  It  can  be  shown 
that  MISE(/?*)  varies  little  across  kernels,  provided  different  optimal  h*  are  used  for 
different  kernels  (Figure  9.4  provides  an  illustration).  It  can  be  shown  that  the  optimal 
kernel  is  the  Epanechnikov,  though  this  advantage  is  slight. 

Bandwidth  choice  is  much  more  important  than  kernel  choice  and  from  (9.10)  this 
varies  with  the  kernel. 
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Plug-in  Bandwidth  Estimate 

A  plug-in  estimate  for  the  bandwidth  is  a  simple  formula  for  h  that  depends  on  the 
sample  size  N  and  the  sample  standard  deviation  s. 

A  useful  starting  point  is  to  assume  that  the  data  are  normally  distributed.  Then 
f  f" (xo)2dxo  =  3/(8yjrcr5)  =  0.21 16/cr5,  in  which  case  (9.10)  specializes  to 

h*  =  1.36435  N~°-2s,  (9.12) 

where  s  is  the  sample  standard  deviation  of  x  and  8  is  given  in  Table  9.1  for  several 
kernels.  For  the  Epanechnikov  kernel  h*  =  2.345 /V  02.v,  and  for  the  Gaussian  kernel 
h*  =  I  .059jV  a2.v.  The  considerably  lower  bandwidth  for  the  normal  kernel  arises 
because,  unlike  most  kernels,  the  normal  kernel  gives  some  weight  to  x,  even  if  | x,-  — 
xol  >  h.  In  practice  one  uses  Silverman’s  plug-in  estimate 

h*  =  1.3643 8N-°'2  min(s,  iqr/ 1.349),  (9.13) 

where  iqr  is  the  sample  interquartile  range.  This  uses  iqr / 1 .349  as  an  alternative 
estimate  of  a  that  protects  against  outliers,  which  can  increase  s  and  lead  to  too  large 
an  h. 

These  plug-in  estimates  for  h  work  well  in  practice,  especially  for  symmetric  uni- 
modal  densities,  even  if  /(x)  is  not  the  normal  density.  Nonetheless,  one  should  also 
check  by  using  variations  such  as  twice  and  half  the  plug-in  estimate. 

For  the  example  in  Figures  9.2  and  9.4  we  have  177~° 2  =  0.3551,  s  =  0.8282,  and 
iqr /\  .349  =  0.6459,  so  (9.13)  yields  h*  =  0.3 1 734.  For  the  Epanechnikov  kernel,  for 
example,  this  yields  h*  =  0.545  since  8  =  1.7188  from  Table  9.1. 


Cross-Validation 

From  (9.9),  ISE(h)  =  f  f2(xo)dxo  —  2  /  f(xo)f(xo)dxo  +  f  f2(xo)dxo.  The  third 
term  does  not  depend  on  li.  An  alternative  data-driven  approach  estimates  the  first 
two  terms  in  ISE(/r)  by 


CV(fi)  = 


N2h 


££* 


(2) 


-  ^££(x')’ 


(9.14) 


1=1 


where  K^2\u)  =  f  K(u  —  t)K{t)dt  is  the  convolution  of  K  with  itself,  and  /_,  (x,  )  is 
the  leave-one-out  kernel  estimator  of  /  (x,  ).  See  Fee  (1996,  p.  137)  or  Pagan  and  Ullah 
(1999,  p.  51)  for  a  derivation.  The  cross-validation  estimate  hex  is  chosen  to  mini¬ 
mize  CV(/j).  It  can  be  shown  that  hex  h*  as  N  oo,  but  the  rate  of  convergence 
is  very  slow. 

Obtaining  hex  is  computationally  burdensome  because  ISE(fi)  needs  to  be  com¬ 
puted  for  a  range  of  values  of  h.  It  is  often  not  necessary  to  cross- validate  for  kernel 
density  estimation  as  the  plug-in  estimate  usually  provides  a  good  starting  point. 
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9.3.7.  Confidence  Intervals 

Kernel  density  estimates  are  usually  presented  without  confidence  intervals,  but  it  is 
possible  to  construct  pointwise  confidence  intervals  for  fix o),  where  pointwise  means 
evaluated  at  a  particular  value  of  xq.  A  simple  procedure  is  to  obtain  confidence  inter¬ 
vals  at  a  small  number  of  evaluation  points  xo,  say  10,  that  are  evenly  distributed  over 
the  range  of  x  and  plot  these  along  with  the  estimated  density  curves. 

The  result  (9.6)  yields  the  following  95%  confidence  interval  for  fix o): 

/(x0)  e  ?(x o)  -  b(x o)  ±  1.96  x  J^/(x „)  J  K(z)2dz. 

For  most  kernels  f  K{z)2dz  is  easily  obtained  by  analytical  methods. 

The  situation  is  complicated  by  the  bias  term,  which  should  not  be  ignored  in  finite 
samples,  even  though  asymptotically  b(x o)  — >  0.  This  is  because  with  optimal  band¬ 
width  h*  =  0(N~02)  the  bias  of  the  rescaled  random  variable  \f~Nh(f(x  o)  —  fix  o)) 
given  in  (9.6)  does  not  disappear,  since  \/Wh*  times  0(h*2)  =  0(1).  The  bias  can  be 
estimated  using  (9.4)  and  a  kernel  estimate  of  f  "{x o),  but  in  practice  the  estimate  of 
f  "(x o)  is  noisy.  Instead,  the  usual  method  is  to  reduce  the  bias  in  computing  the  confi¬ 
dence  interval,  but  not  fix  o)  itself,  by  undersmoothing,  that  is,  by  choosing  h  <  h*  so 
that  h*  =  oiN~0-2).  Other  approaches  include  using  a  higher  order  kernel,  such  as  the 
fourth-order  kernels  given  in  Table  9.1,  or  bootstrapping  (see  Section  1 1.6.5). 

One  can  also  compute  confidence  bands  for  fix)  over  all  possible  values  of  x. 
These  are  wider  than  the  pointwise  confidence  intervals  for  each  value  xq. 


9.3.8.  Estimation  of  Derivatives  of  a  Density 

In  some  cases  estimates  of  the  derivatives  of  a  density  need  to  be  made.  For  example, 
estimation  of  the  bias  term  of  fix  o)  given  in  (9.4)  requires  an  estimate  of  fix  o). 

For  simplicity  we  present  estimates  of  the  first  derivative.  A  finite-difference 
approach  uses  f'ixf)  =  [fix o  +  A)  —  fix o  —  A)]/2A.  A  calculus  approach  in¬ 
stead  takes  the  first  derivative  of  /(x o)  in  (9.3),  yielding  fix o)  =  —  iNh2)~l 
Y.K'iiXi-xd/h). 

Intuitively,  a  larger  bandwidth  should  be  used  to  estimate  derivatives,  which  can  be 
more  variable  than  /(x o).  The  bias  of  fs\x o)  is  as  before  but  the  variance  converges 
more  slowly,  leading  to  optimal  bandwidth  h*  =  0(N  1  /^2v  i  2/j+  i  > ^  jf-  f(X(p  js  p  times 
differentiable.  For  kernel  estimation  of  the  first  derivative  we  need  p  >  3. 


9.3.9.  Multivariate  Kernel  Density  Estimate 


The  preceding  discussion  considered  kernel  density  estimation  for  scalar  x.  For  the 
density  of  the  k-dimensional  random  variable  x,  the  multivariate  kernel  density  esti¬ 
mator  is 


/(X  o)  = 


Nhk 


Ek 


i=l 


Xo 
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where  K(-)  is  now  a  k-dimensional  kernel.  Usually  A'(-)  is  a  product  kernel,  the  prod¬ 
uct  of  one-dimensional  kernels.  Multivariate  kernels  such  as  the  multivariate  normal 
density  or  spherical  kernels  proportionate  to  K( z'z)  can  also  be  used.  The  kernel  K (•) 
satisfies  properties  similar  to  properties  given  in  the  one-dimensional  case;  see  Lee 
(1996,  p.  125). 

The  analytical  results  and  expressions  are  similar  to  those  before,  except  that  the 
variance  of  fix o)  declines  at  rate  0(Nhk),  which  for  k  >  1  is  slower  than  O(Nh)  in 
the  one-dimensional  case.  Then 


yW?(*o)  -  /(x o)  -  b(x o))  4  AT 


J 


0,  fix o)  /  Kizfdz 


The  optimal  bandwidth  choice  is  h  =  0(N^l^k+4)),  which  is  larger  than  0(N  °  2)  in 
the  one-dimensional  case,  and  implies  \/Nhk  =  ()( N2/(4+k>).  The  plug-in  and  cross- 
validation  methods  can  be  extended  to  the  multivariate  case.  For  the  product  normal 
kernel  Scott’s  plug-in  estimate  for  the  / th  component  of  x  is  hj  =  N  l^k+4)Sj,  where 
Sj  is  the  sample  standard  deviation  of  Xj . 

Problems  of  sparseness  of  data  are  more  likely  to  arise  with  a  multivariate  kernel. 
There  is  a  curse  of  dimensionality,  as  fewer  observations  in  the  vicinity  of  xo  receive 
substantial  weight  when  x  is  of  higher  dimension.  Even  when  this  is  not  a  problem, 
plotting  even  a  bivariate  kernel  density  estimate  requires  a  three-dimensional  plot  that 
can  be  difficult  to  read  and  interpret. 

One  use  of  a  multivariate  kernel  density  estimate  is  to  permit  estimation  of  a 
conditional  density.  Since  f(y\x)  =  fix ,  y)/f(x),  an  obvious  estimator  is  f(y\x)  = 
fix,  y)/ fix),  where  f(x,y)  and  fix)  are  bivariate  and  univariate  kernel  density 
estimates. 


9.3.10.  Higher  Order  Kernels 

The  preceding  analysis  assumes  fix)  is  twice  differentiable,  a  necessary  assumption  to 
obtain  the  bias  term  in  (9.4).  If  fix)  is  more  than  twice  differentiable  then  using  higher 
order  kernels  (see  Section  9.3.3  for  fourth-order  examples)  reduces  the  order  of  the 
bias,  leading  to  smaller  h*  and  faster  rates  of  convergence.  A  general  statement  is  that 
if  x  is  k  dimensional  and  fix)  is  p  times  differentiable  and  a  pth-order  kernel  is  used, 
then  the  kernel  estimate  fix  o)  of  fix)  has  optimal  rate  of  convergence  A~p/(2p+ir) 
when  h*  =  OiN~ll{lp+k)). 


9.3.11.  Alternative  Nonparametric  Density  Estimates 

The  kernel  density  estimate  is  the  standard  nonparametric  estimate.  Other  density  es¬ 
timates  are  presented,  for  example,  in  Pagan  and  Ullah  (1999).  These  often  use  ap¬ 
proaches  such  as  nearest-neighbors  methods  that  are  more  commonly  used  in  non¬ 
parametric  regression  and  are  presented  briefly  in  Section  9.6. 
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9.4.  Nonparametric  Local  Regression 

We  consider  regression  of  scalar  dependent  variable  y  on  a  scalar  regressor  variable  x. 
The  regression  model  is 

y,  =  mix,)  +  Si,  i  =  1, ,  N, 

Si  ~  hd  [0,  ae2]. 

The  complication  is  that  the  functional  form  m(-)  is  not  specified,  so  NLS  estimation 
is  not  possible. 

This  section  provides  a  simple  general  treatment  of  nonparametric  regression  us¬ 
ing  local  weighted  averages.  Specialization  to  kernel  regression  is  given  in  Section  9.5 
and  other  commonly  used  local  weighted  methods  are  presented  in  Section  9.6. 

9.4.1.  Local  Weighted  Averages 

Suppose  that  for  a  distinct  value  of  the  regressor,  say  xq,  there  are  multiple  obser¬ 
vations  on  y,  say  No  observations.  Then  an  obvious  simple  estimator  for  m(x o)  is 
the  sample  average  of  these  No  values  of  y,  which  we  denote  m(x o).  It  follows  that 
m(xo)  ~  [m(x o),  Nq1o2\  since  it  is  the  average  of  No  observations  that  by  (9.15)  are 
iid  with  mean  m(x o)  and  variance  rr2. 

The  estimator  m(x o)  is  unbiased  but  not  necessarily  consistent.  Consistency  requires 
No  — >  oo  as  N  — >  oo,  so  that  V[m(x o)]  -a-  0.  With  discrete  regressors  this  estimator 
may  be  very  noisy  in  finite  samples  because  No  may  be  small.  Even  worse,  for  con¬ 
tinuous  regressors  there  may  be  only  one  observation  for  which  x,-  takes  the  particular 
value  xq,  even  as  N  oo. 

The  problem  of  sparseness  in  data  can  be  overcome  by  averaging  observed  values 
of  y  when  x  is  close  to  xq,  in  addition  to  when  x  exactly  equals  xq.  We  begin  by  noting 
that  the  estimator  m(x o)  can  be  expressed  as  a  weighted  average  of  the  dependent 
variable,  with  m(x o)  =  JT  w,o.Vm  where  the  weights  u;,o  equal  1  /Nq  if  x,  =  xo  and 
equal  0  if  x,  A  xo-  Thus  the  weights  vary  with  both  the  evaluation  point  xo  and  the 
sample  values  of  the  regressors. 

More  generally  we  consider  the  local  weighted  average  estimator 

N 

m(x 0)  =  ^  wi0,hyi ,  (9.16) 

1  =  1 

where  the  weights 

u>i0,h  =  w(xi,x o,  h ) 

sum  to  one,  so  uj,o,a  =  1.  The  weights  are  specified  to  increase  as  x,  becomes 
closer  to  xo- 

The  additional  parameter  h  is  generic  notation  for  a  window  width  parameter.  It 
is  defined  so  that  smaller  values  of  h  lead  to  a  smaller  window  and  more  weight  being 
placed  on  those  observations  with  x,  close  to  xo-  In  the  specific  example  of  kernel 
regression,  h  is  the  bandwidth.  Other  methods  given  in  Section  9.6  have  alternative 
smoothing  parameters  that  play  a  similar  role  to  h  here.  As  It  becomes  smaller  m(x o) 
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becomes  less  biased,  as  only  observations  close  to  xo  are  being  used,  but  more  variable, 
as  fewer  observations  are  being  used. 

The  OLS  predictor  for  the  linear  regression  model  is  a  weighted  average  of  yt,  since 
some  algebra  yields 


N 

mois(xo)  =  ^ 
!  =  1 


1 

N 


(X0  -  X)(X;  -  X)  I 


The  OLS  weights,  however,  can  actually  increase  with  increasing  distance  between  xq 
and  Xj  if,  for  example,  x,  >  Xo  >  x.  Local  regression  instead  uses  weights  that  are 
decreasing  in  |x,-  —  xq|. 


9.4.2.  K-Nearest  Neighbors  Example 


We  consider  a  simple  example,  the  unweighted  average  of  the  y  values  correspond¬ 
ing  to  the  closest  (k  —  l)/2  observations  on  x  less  than  xo  and  the  closest  (. k  —  l)/2 
observations  on  x  greater  than  xo. 

Order  the  observations  by  increasing  x  values.  Then  evaluation  at  xo  =  x,  yields 

1 

mk(Xi)  =  -(y,-(A--i)/2  H - h  yt+(k-i )/2), 

where  for  simplicity  k  is  odd,  and  potential  modifications  caused  by  ties  and  values  of 
xo  close  to  the  end  points  x\  or  x,y  are  ignored.  This  estimator  can  be  expressed  as  a 
special  case  of  (9.16)  with  weight 


WiO.k  =  7  x  1 
k 


XX  <  X2  <■■■<  XQ  <■■■  <  XN. 


This  estimator  has  many  names.  We  refer  to  it  as  a  (symmetrized)  k-nearest  neigh¬ 
bors  estimator  (k— NN),  defined  in  Section  9.6.1.  It  is  also  a  standard  local  running 
average  or  running  mean  or  moving  average  of  length  k  centered  at  xo  that  is  used, 
for  example,  to  plot  a  time  series  y  against  time  x.  The  parameter  k  plays  the  role  of 
the  window  width  h  in  Section  9.4.1,  with  small  k  corresponding  to  small  h. 

As  an  illustration,  consider  data  generated  from  the  model 

yt  —  150  +  6.5x,-  —  0.15x?  +  O.OOlx?  +  e,-,  i  =  l,...,100,  (9.17) 

x/  =  i, 

St  ~  7V[0,  252]. 


The  mean  of  y  is  a  cubic  in  x,  with  x  taking  values  1,2 . 100,  with  turning  points 

at  x  =  20  and  x  =  80.  To  this  is  added  a  normally  distributed  error  term  with  standard 
deviation  25. 

Figure  9.5  plots  the  symmetrized  k-NN  estimator  with  k  =  5  and  25.  Both  moving 
averages  suggest  a  cubic  relationship.  The  second  is  smoother  than  the  first  but  is  still 
quite  jagged  despite  one-quarter  of  the  sample  being  used  to  form  the  average.  The 
OLS  regression  line  is  also  given  on  the  diagram. 
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k-Nearest  Neighbors  Regression  as  k  Varies 


Regressor  x 

Figure  9.5:  /(-nearest  neighbors  regression  curve  for  two  different  choices  of  k,  as  well  as 
OLS  regression  line.  The  data  are  generated  from  a  cubic  polynomial  model. 


The  slope  of  m /. (x )  is  flatter  at  the  end  points  when  k  =  25  rather  than  k  =  5.  This 
illustrates  a  boundary  problem  in  estimating  m(x)  at  the  end  points.  For  example, 
for  the  smallest  regressor  value  x\  there  are  no  lower  valued  observations  on  x 

to  be  included,  and  the  average  becomes  a  one-sided  average  rhk(x i)  =  (yi  H - + 

yi+(*_i)/2 )/[(&  +  l)/2].  Since  for  these  data  m^ix)  is  increasing  in  x  in  this  region, 
this  leads  to  i)  being  an  overestimate  and  the  overstatement  is  increasing  in  k. 
Such  boundary  problems  are  reduced  by  instead  using  methods  given  in  Section  9.6.2. 


9.4.3.  Lowess  Regression  Example 

Using  alternative  weights  to  those  used  to  form  the  symmetrized  k-NN  estimator  can 
lead  to  better  estimates  of  m(x). 

An  example  is  the  Lowess  estimator,  defined  in  Section  9.6.2.  This  provides  a 
smoother  estimate  of  m(x)  as  it  uses  kernel  weights  rather  than  an  indicator  func¬ 
tion,  analogous  to  a  kernel  density  estimate  being  smoother  than  a  running  histogram. 
It  also  has  smaller  bias  (see  Section  9.6.2),  which  is  especially  beneficial  in  estimating 
m(x)  at  the  end  points. 

Figure  9.6  plots,  for  data  generated  by  (9. 17),  the  Lowess  estimate  with  k  =  25.  This 
local  regression  estimate  is  quite  close  to  the  true  cubic  conditional  mean  function, 
which  is  also  drawn.  Comparing  Figure  9.6  to  Figure  9.5  for  symmetrized  k-NN  with 
k  =  25,  we  see  that  Lowess  regression  leads  to  a  much  smoother  regression  function 
estimate  and  more  precise  estimation  at  the  boundaries. 


9.4.4.  Statistical  Inference 

When  the  error  term  is  normally  distributed  and  analysis  is  conditional  on  x\,  . . . ,  x,y, 
the  exact  small-sample  distribution  of  m{x o)  in  (9.16)  can  be  easily  obtained. 
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Lowess  Nonparametric  Regression 


Regressor  x 

Figure  9.6:  Nonparametric  regression  curve  using  Lowess,  as  well  as  a  cubic  regression 
curve.  Same  generated  data  as  Figure  9.5. 


Substituting  y,-  =  mix,  )  +  e,  into  the  definition  of  m(x o)  leads  directly  to 

N  N 

m(xo)  -  ^2  Wi0 •*'»(*;)  =  ^2  Wi0 -hSi’ 

i= 1  /= 1 


which  implies  with  fixed  regressors,  and  if  £,  are  iid  Ar[0,  cr2],  that 


m{x o)  ~  M 


Wioj,m(xi),  a 2  ^2 


w 


iO,h 


(=1 


1=1 


(9.18) 


Note  that  in  general  m(xo)  is  biased  and  the  distribution  is  not  necessarily  centered 
around  m(x o). 

With  stochastic  regressors  and  nonnormal  errors,  we  condition  on  x\, . . . ,  .r.y  and 
apply  a  central  limit  theorem  for  U-statistics  that  is  appropriate  for  double  summations 
(see,  for  example,  Pagan  and  Ullah,  1999,  p.  359).  Then  for  e,  iid  [0,  a, 2 1, 


C(N)  ^  WjQ  hEj  S  M 


i= i 


0,  cr2limc(iV)2^ 


w ; 


0  ,h 


i=l 


(9.19) 


where  c(N)  is  a  function  of  the  sample  size  with  0(c(N))  <  N]/1  that  can  vary  with 
the  local  estimator.  For  example,  c(N )  =  s/Nh  for  kernel  regression  and  c(N)  =  N0A 
for  kernel  regression  with  optimal  bandwidth.  Then 


c(N)  im{x o)  -  m(x0)  -  b(x0 ))  ->■  A f 


0, 


a2  lim  c{N)2  ^ 


w ; 


0,/j 


(9.20) 


where  bixo)  =  m(x o)—  X!,'  wiO,hm(xi).  Note  that  (9.20)  yields  (9.18)  for  the  asymp¬ 
totic  distribution  of  m(x o). 

Clearly,  the  distribution  of  m(x o),  a  simple  weighted  average,  can  be  obtained  un¬ 
der  alternative  distributional  assumptions.  For  example,  for  heteroskedastic  errors 
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the  variance  in  (9.19)  and  (9.20)  is  replaced  by  lim  c(N)2  <r2(  wf(j  h,  which  can  be 

consistently  estimated  by  replacing  rr2(  by  the  squared  residual  (y,  —  m(x,))2.  Alter¬ 
natively,  one  can  bootstrap  (see  Section  1 1.6.5). 


9.4.5.  Bandwidth  Choice 

Throughout  this  chapter  we  follow  the  nonparametric  terminology  that  an  estimator 
6  of  do  has  convergence  rate  N~r  if  6  =  60  +  Op(N~r),  so  that  Nr(6  —  do)  =  Op{  1) 
and  ideally  Nr(d  —  do)  has  a  limit  normal  distribution.  Note  in  particular  that  an  esti¬ 
mator  that  is  commonly  called  a  \fN -consistent  estimator  is  converging  at  rate  N'  ^2. 
Nonparametric  estimators  typically  have  a  slower  rate  of  convergence  than  this,  with 
r  <  1/2,  because  small  bandwidth  h  is  needed  to  eliminate  bias  but  then  less  than  N 
observations  are  being  used  to  estimate  m(x o). 

As  an  example,  consider  the  A-NN  example  of  Section  9.4.2.  Suppose  k  =  A/4,/5,  so 
that  for  example  k  =  251  if  N  =  1,000.  Then  the  estimator  is  consistent  as  the  moving 
average  uses  N4^5/N  =  N  1/2  of  the  sample  and  is  therefore  collapsing  around  xo  as 
N  — »■  oo.  Using  (9.18),  the  variance  of  the  moving  average  estimator  is  cr2  JT  wj()  k  = 
cr2  x  A  x  (1/A)2  =  cr2  x  1/A  =  ct2A“4/5,  so  in  (9.19)  c(N)  =  ^/k  =  VN4/5  =  N0A, 
which  is  less  than  A1/2.  Other  values  of  A  also  ensure  consistency,  provided  A  <  O(N). 

More  generally,  a  range  of  values  of  the  bandwidth  parameter  eliminates  asymptotic 
bias,  but  smaller  bandwidth  increases  variability.  In  this  literature  this  trade-off  is  ac¬ 
counted  for  by  minimizing  mean-squared  error,  the  sum  of  variance  and  bias  squared. 

Stone  (1980)  showed  that  if  x  is  A  dimensional  and  m(x)  is  p  times  differentiable 
then  the  fastest  possible  rate  of  convergence  for  a  nonparametric  estimator  of  an  .vth- 
order  derivative  of  m(x)  is  N~r,  where  r  =  (p  —  s)/(2p  +  A).  This  rate  decreases  as 
the  order  of  the  derivative  increases  and  as  the  dimension  of  x  increases.  It  increases  the 
more  differentiable  m(x)  is  assumed  to  be,  approaching  N  1/2  if  mix)  has  derivatives 
of  order  approaching  infinity.  For  scalar  regression  estimation  of  m(x)  it  is  customary 
to  assume  existence  of  m"(x),  in  which  case  r  =  2/5  and  the  fastest  convergence  rate 
is  lV“a4. 


9.5.  Kernel  Regression 

Kernel  regression  is  a  weighted  average  estimator  using  kernel  weights.  Issues  such  as 
bias  and  choice  of  bandwidth  presented  for  kernel  density  estimation  are  also  relevant 
here.  However,  there  is  less  guidance  for  choice  of  bandwidth  than  in  the  regression 
case.  Also,  while  we  present  kernel  regression  for  pedagogical  reasons,  kernel  local 
regression  estimators  are  often  used  in  practice  (see  Section  9.6). 


9.5.1.  Kernel  Regression  Estimator 

The  goal  in  kernel  regression  is  to  estimate  the  regression  function  m(x)  in  the  model 
y  =  m(x)  +  e  defined  in  (9.15). 
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From  Section  9.4.1,  an  obvious  estimator  of  m(x o)  is  the  average  of  the  sample 
values  yi  of  the  dependent  variable  corresponding  to  the  xt  s  close  to  xq.  A  variation 
on  this  is  to  find  the  average  of  the  yts  for  all  observations  with  x ,■  within  distance  h  of 
X().  This  can  be  formally  expressed  as 


m(x o)  = 


E,-,M|y|<0* 

Ei'-.iflV'GO 


where  as  before  1(A)  =  1  if  event  A  occurs  and  equals  0  otherwise.  The  numerator 
sums  the  y  values  and  the  denominator  gives  the  number  of  y  values  that  are  summed. 

This  expression  gives  equal  weights  to  all  observations  close  to  xq,  but  it  may  be 
preferable  to  give  the  greatest  weight  at  xq  and  decrease  the  weight  as  we  move  away. 
Thus  more  generally  we  consider  a  kernel  weighting  function  K(-),  introduced  in  Sec¬ 
tion  9.3.2.  This  yields  the  kernel  regression  estimator 


m(x0)  = 


Nh  X-,i= 1  h  ) 


(9.21) 


Several  common  kernel  functions  -  uniform,  Gaussian,  Epanechnikov,  and  quartic  - 
have  already  been  given  in  Table  9. 1 . 

The  constant  h  is  called  the  bandwidth,  and  2 h  is  called  the  window  width.  The 
bandwidth  plays  the  same  role  as  k  in  the  k-NN  example  of  Section  9.4.2. 

The  estimator  (9.21)  was  proposed  by  Nadaraya  (1964)  and  Watson  (1964), 
who  gave  an  alternative  derivation.  The  conditional  mean  m(x)  =  f  yf(y\x)dy  = 
f  y[f(y>  x)/f(x)]dy,  which  can  be  estimated  by  m(x)  =  f  y[f(y,  x)/  f(x)]dy,  where 
f(y,  x)  and  f(x )  are  bivariate  and  univariate  kernel  density  estimators.  It  can  be  shown 
that  this  equals  the  estimator  in  (9.21).  The  statistics  literature  also  considers  kernel  re¬ 
gression  in  the  fixed  design  or  fixed  regressors  case  where  f(x)  is  known  and  need  not 
be  estimated,  whereas  we  consider  only  the  case  of  stochastic  regressors  that  arises 
with  observational  data. 

The  kernel  regression  estimator  is  a  special  case  of  the  weighted  average  (9.16), 
with  weights 


U>iO,h 


Nh  ^  v  h  ) 

(Xj-XQ\  ’ 

Nh  xL,i=l  t  h  ) 


(9.22) 


which  by  construction  sum  over  i  to  one.  The  general  results  of  Section  9.4  are  relevant, 
but  we  give  a  more  detailed  analysis. 


9.5.2.  Statistical  Inference 

We  present  the  distribution  of  the  kernel  regression  estimator  m(x)  for  given  choice 
of  K(-)  and  h,  assuming  the  data  x  are  iid.  We  implicitly  assume  that  regressors  are 
continuous.  With  discrete  regressors  «i(.ro)  will  still  collapse  on  m(x o),  and  both  m(x o) 
in  the  limit  and  m(x o)  are  step  functions. 
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Consistency 

Consistency  of  w(xo)  for  the  conditional  mean  function  m(x o)  requires  h  — >■  0,  so  that 
substantial  weight  is  given  only  to  x,  very  close  to  xq.  At  the  same  time  we  need  many 
x,  close  to  xo,  so  that  many  observations  are  used  in  forming  the  weighted  average. 
Formally,  m(x o)  mix o)  if  h  0  and  Nil  — >•  oo  as  N  — >  oo. 


Bias 

The  kernel  regression  estimator  is  biased  of  size  0(h2),  with  bias  term 

b{x 0)  =  h2  (m'(x 0) +  ]-m"(x0)\  J  z2K(z)dz  (9.23) 

(see  Section  9.8.2)  assuming  m(x)  is  twice  differentiable.  As  for  kernel  density  estima¬ 
tion,  the  bias  varies  with  the  kernel  function  used.  More  importantly,  the  bias  depends 
on  the  slope  and  curvature  of  the  regression  function  mix o)  and  the  slope  of  the  density 
fix  o)  of  the  regressors,  whereas  for  density  estimation  the  bias  depended  only  on  the 
second  derivatives  of  f(x o).  The  bias  can  be  particularly  large  at  the  end  points,  as 
illustrated  in  Section  9.4.2. 

The  bias  can  be  reduced  by  using  higher  order  kernels,  defined  in  Section  9.3.3,  and 
boundary  modifications  such  as  specific  boundary  kernels.  Local  polynomial  regres¬ 
sion  and  modifications  such  as  Lowess  (see  Section  9.6.2)  have  the  attraction  that  the 
term  in  (9.23)  depending  on  m'(x o)  drops  out  and  perform  well  at  the  boundaries. 


Asymptotic  Normality 


In  Section  9.8.2  it  is  shown  that,  for  x,  iid  with  density  /(x,),  the  kernel  regression 
estimator  has  limit  distribution 


V Nh(m(x o)  —  m{x o)  —  fi(x o))  ->  Jf 


fix  o) 


/ 


Kizfdz 


(9.24) 


The  variance  term  in  (9.24)  is  larger  for  small  fix o),  so  as  expected  the  variance  of 
m(xo)  is  larger  in  regions  where  x  is  sparse. 


9.5.3.  Bandwidth  Choice 

Incorporating  values  of  y,  for  which  x,  f  xq  into  the  weighted  average  introduces  bias, 
since  E[y,  |x,]  =  m (x, )  f  mix o)  for  x,-  f  xq.  However,  using  these  additional  points 
reduces  the  variance  of  the  estimator,  since  we  are  averaging  over  more  data.  The  opti¬ 
mal  bandwidth  balances  the  trade-off  between  increased  bias  and  decreased  variance, 
using  squared  error  loss.  Unlike  kernel  density  estimation,  plug-in  approaches  are  im¬ 
practical  and  cross-validation  is  used  more  extensively. 

For  simplicity  most  studies  focus  on  choosing  one  bandwidth  for  all  values  of  xo. 
Some  methods  with  variable  bandwidths,  notably  k-NN  and  Lowess,  are  given  in 
Section  9.6. 
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Mean  Integrated  Squared  Error 

The  local  performance  of  m(-)  at  xq  is  measured  by  the  mean-squared  error,  given 

by 

MSE[m(xo)]  =  E[(m(xo)  —  m  (*o))2]> 

where  the  expectation  eliminates  dependence  of  m(x o)  on  x.  Since  MSE  equals  vari¬ 
ance  plus  squared  bias,  the  MSE  can  be  obtained  using  (9.23)  and  (9.24). 

Similar  to  Section  9.3.6,  the  integrated  square  error  is 

ISE(/0  =  J (m(x0)  -  m  (x0))2  f(x0)dx0, 

where  f(x)  denotes  the  density  of  the  regressors  x,  and  the  mean  integrated  square 
error,  or  equivalently  the  integrated  mean-squared  error,  is 

MISE(/?)  =  J  MSE[m(x0)]f(x0)dx0. 

Optimal  Bandwidth 

The  optimal  bandwidth  h*  minimizes  MISE(/z).  This  yields  h*  =  0(N  °-2)  since 
the  bias  is  0(h2)  from  (9.23);  the  variance  is  0((Nh)~l )  from  (9.24)  since  an  0(1) 
variance  is  obtained  after  scaling  m(x o)  by  s/Nh\  and  for  bias  squared  and  variance  to 
be  of  the  same  order  (h2)2  =  (Nh)  1  or  h  =  N~0  2.  The  kernel  estimate  then  converges 
to  m(x o)  at  rate  (Nh*)  1/2  =  N  ,u  rather  than  the  usual  N  05  for  parametric  analysis. 

Plug-in  Bandwidth  Estimate 

One  can  obtain  an  exact  expression  for  h*  that  minimizes  MISE(7z),  using  calculus 
methods  similar  to  those  in  Section  9.3.5  for  the  kernel  density  estimator.  Then  h* 
depends  on  the  bias  and  variance  expressions  in  (9.23)  and  (9.24). 

A  plug-in  approach  calculates  h*  using  estimates  of  these  unknowns.  However, 
estimation  of  m"(x),  for  example,  requires  nonparametric  methods  that  in  turn  require 
an  initial  bandwidth  choice,  but  h*  also  depends  on  unknowns  such  as  m"(x).  Given 
these  complications  one  should  be  wary  of  plug-in  estimates.  More  common  is  to  use 
cross-validation,  presented  in  the  following. 

It  can  also  be  shown  that  MISE(/i*)  is  minimized  if  the  Epanichnikov  kernel  is 
used  (see  Hardle,  1990,  p.  186,  or  Hardle  and  Linton,  1994,  p.  2321),  though  as  in  the 
kernel  regression  case  MISE(/z*)  is  not  much  larger  for  other  kernels.  The  key  issue  is 
determination  of  h*,  which  will  vary  with  kernel  and  the  data. 

Cross-Validation 

An  empirical  estimate  of  the  optimal  h  can  be  obtained  by  the  leave-one-out  cross- 
validation  procedure.  This  chooses  h*  that  minimizes 

N 

CV(/0  =  J^iyi  -  m-iteyfniXi),  (9.25) 

i= 1 
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where  it  (x,)  is  a  weighting  function  (discussed  in  the  following)  and 

m-i(Xi)  =  ^  Wji.hyj/ X!  wj‘’h  (9.26) 


is  a  leave-one-out  estimate  of  zzz(x;)  obtained  by  the  kernel  formula  (9.21),  or  more 
generally  by  a  weighted  procedure  (9.16),  with  the  modification  that  y,  is  dropped. 

Cross-validation  is  not  as  computationally  intensive  as  it  first  appears.  It  can  be 
shown  that 


yt  -  m-iix,)  — 


yt  -  m(Xj) 

1  \}^ii,h/  'y  '  j  W ji,h\ 


(9.27) 


so  that  for  each  value  of  h  cross-validation  requires  only  one  computation  of  the 
weighted  averages  zzz(x,),  z  =  1 , ,N . 

The  weights  7r(x,)  are  introduced  to  potentially  downweight  the  end  points,  which 
otherwise  may  receive  too  much  importance  since  local  weighted  estimates  can  be 
quite  highly  biased  at  the  end  points  as  illustrated  in  Section  9.4.2.  For  example,  ob¬ 
servations  with  x,  outside  the  5th  to  95th  percentiles  may  not  be  used  in  calculating 
CV(/z),  in  which  case  n (x, )  =  0  for  these  observations  and  jt (x, )  =  1  otherwise.  The 
term  cross-validation  is  used  as  it  validates  the  ability  to  predict  the  z'th  observation  us¬ 
ing  all  the  other  observations  in  the  data  set.  The  / th  observation  is  dropped  because  if 
instead  it  was  additionally  used  in  the  prediction,  then  CV(/i)  would  be  trivially  mini¬ 
mized  when  m (x, )  =  y, ,  i  =  1 , . . . ,  N .  CV(/z )  is  also  called  the  estimated  prediction 
error. 

Hardle  and  Marron  (1985)  showed  that  minimizing  CV(/i)  is  asymptotically  equiv¬ 
alent  to  minimizing  a  modification  of  ISE(/z)  and  MISE(/z).  The  modification  includes 
weight  function  7r(xo)  in  the  integrand,  as  well  as  the  averaged  squared  error  (ASE) 
N  " 1  £,(m(x,)  —  m(Xi))2n(Xi),  which  is  a  discrete  sample  approximation  to  ISE(/j). 
The  measure  CV(/z)  converges  at  the  slow  rate  of  0(N  (KI)  however,  so  CV(/z)  can  be 
quite  variable  in  finite  samples. 


Generalized  Cross-Validation 

An  alternative  to  leave-one-out  cross  validation  is  to  use  a  measure  similar  to  CV(/j) 
but  one  that  more  simply  uses  mix,)  rather  than  m_,(x,  )  and  then  adds  a  model  com¬ 
plexity  penalty  that  increases  as  the  bandwidth  h  decreases.  This  leads  to 

N 

PV(/z)  =  T>,  -  m(xi))27t(xi)p{wu,h), 

i= 1 

where  /?(•)  is  the  penalty  function  and  wu ^  is  the  weight  given  to  the  i th  observation 
in  zzz(x,-)  =  J2j  wji,hyj- 

A  popular  example  is  the  generalized  cross-validation  measure  that  uses  the 
penalty  function  piwaj,)  =  (1  —  wa./,)2-  Other  penalties  are  given  in  Hardle  (1990, 
p.  167)  and  Hardle  and  Linton  (1994,  p.  2323). 
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Cross-Validation  Example 

For  the  local  running  average  example  in  Section  9.4.2,  CV(&)  =  54,811,  56,666, 
63,456,  65,605,  and  69,939  for  k  =  3,  5,  7,  9,  and  25,  respectively.  In  this  case  all 
observations  were  used  to  calculate  CV(k),  with  it (x, )  =  1,  despite  possible  end-point 
problems.  There  is  no  real  gain  after  k  =  5,  though  from  Figure  9.5  this  value  pro¬ 
duced  too  rough  an  estimate  and  in  practice  one  would  choose  a  higher  value  of  k  to 
get  a  smoother  curve. 

More  generally  cross-validation  is  by  no  means  perfect  and  it  is  common  to  “eye¬ 
ball”  fitted  nonparametric  curves  to  select  h  to  achieve  a  desired  degree  of  smoothness. 


Trimming 

The  denominator  of  the  kernel  estimator  in  (9.21)  is  f(x o),  the  kernel  estimate  of  the 
density  of  the  regressor  at  xo.  At  some  evaluation  points  /(x,)  can  be  very  small, 
leading  to  a  very  large  estimate  m(x,).  Trimming  eliminates  or  greatly  downweights 
all  points  with  /(x,  )  <  b,  say,  where  h  — »■  0  at  an  appropriate  rate  as  N  oo.  Such 
problems  are  most  likely  to  occur  in  the  tails  of  the  distribution.  For  nonparametric 
estimation  one  can  just  focus  on  estimation  of  m(x,)  for  more  central  values  of  x,  ,  and 
values  in  the  tails  may  be  downweighted  in  cross-validation.  However,  the  semipara- 
metric  methods  of  Section  9.7  can  entail  computation  of  m (x,- )  at  all  values  of  x,  ,  in 
which  case  it  is  not  unusual  to  trim.  Ideally,  the  trimming  function  should  make  no 
difference  asymptotically,  though  it  will  make  a  difference  in  finite  samples. 


9.5.4.  Confidence  Intervals 

Kernel  regression  estimates  should  generally  be  presented  with  pointwise  confidence 
intervals.  A  simple  procedure  is  to  present  pointwise  confidence  intervals  for  /(x o) 
evaluated  at,  for  example,  xo  equal  to  the  first  through  ninth  deciles  of  x. 

If  the  bias  b(x o)  in  m(xo)  is  ignored,  (9.24)  yields  the  following  95%  confidence 
interval: 


m(x0)  e  m(x0)  ±  1-96 7-J- f  K(z)2dz, 

V  Nh  fix 0)  J 

where  o2  =  JV  uyo./ief  and  uj,o,a  is  defined  in  (9.22)  and  /(x o)  is  the  kernel  density 
estimate  at  xq.  This  estimate  assumes  homoskedastic  errors,  though  is  likely  to  be 
somewhat  robust  to  heteroskedasticity  since  observations  close  to  xo  are  given  the 
greatest  weight.  Alternatively,  from  the  discussion  after  (9.20)  a  heteroskedastic  robust 
95%  confidence  interval  is  m{x o)  ±  1.963o,  where 'sfi  =  JT  w20  fs2. 

As  in  the  kernel  density  case,  the  bias  in  ffi(.xo)  should  not  be  ignored.  As  already 
noted,  estimation  of  the  bias  is  difficult.  Instead,  the  standard  procedure  is  to  under¬ 
smooth,  with  smaller  bandwidth  h  satisfying  h  =  o(N  °  2)  rather  than  the  optimal 
h*  =  0(N~°'2). 
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Hardle  (1990)  gives  a  detailed  presentation  of  confidence  intervals,  including  uni¬ 
form  confidence  bands  rather  than  pointwise  intervals,  and  the  bootstrap  methods  given 
in  Section  1 1.6.5. 

9.5.5.  Derivative  Estimation 

In  regression  we  are  often  interested  in  how  the  conditional  mean  of  y  changes  with 
changes  in  x,  the  marginal  effect,  rather  than  the  conditional  mean  per  se. 

Kernel  estimates  can  be  easily  used  to  form  the  derivative.  The  general  result  is  that 
the  ,vth  derivative  of  the  kernel  regression  estimate,  m(s>(xo),  is  consistent  for  m<s>(xo), 
the  vth  derivative  of  the  conditional  mean  m(x o).  Either  calculus  or  finite-difference 
approaches  can  be  taken. 

As  an  example,  consider  estimation  of  the  first  derivative  in  the  generated-data 
example  of  the  previous  section.  Let  z  i,  . . . ,  zn  denote  the  ordered  points  at  which 
the  kernel  regression  function  is  evaluated  and  m(z,  \ ), . . . ,  m(zN )  denote  the  estimates 
at  these  points.  A  finite-difference  estimate  is  m'(Zi)  =  [mfe)  —  m{Zi~i)]/[Zi  — 

This  is  plotted  in  Figure  9.7,  along  with  the  true  derivative,  which  for  the  dgp  given 
in  (9.17)  is  the  quadratic  m'(z< )  =  6.5  —  0.30z,  +  O.tXBz?.  As  expected  the  derivative 
estimate  is  somewhat  noisy,  but  it  picks  up  the  essentials.  Derivative  estimates  should 
be  based  on  oversmoothed  estimates  of  the  conditional  mean.  For  further  details  see 
Pagan  and  Ullah  (1999,  chapter  4).  Hardle  (1990,  p.  160)  presents  adaptation  of  cross- 
validation  to  derivative  estimation. 

In  addition  to  the  local  derivative  m'(x o)  we  may  also  be  interested  in  the  average 
derivative  E [m'(x)].  The  average  derivative  estimator  given  in  Section  9.7.4  provides 
a  \Tn -consistent  and  asymptotically  normal  estimate  of  E [m\x)\. 

9.5.6.  Conditional  Moment  Estimation 

The  kernel  regression  methods  for  the  conditional  mean  E[y|x]  =  mix )  can  be  ex¬ 
tended  to  nonparametric  estimation  of  other  conditional  moments. 


Nonparametric  Derivative  Estimation 


Regressor  x 

Figure  9.7:  Nonparametric  derivative  estimate  using  previously  estimated  Lowess  re¬ 
gression  curve,  as  well  as  using  a  cubic  regression  curve.  Same  generated  data  as 
Figure  9.5. 
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For  raw  conditional  moments  such  as  E[yfc  |ac]  we  use  the  weighted  average 

N 

E[/|  *o]  =  £>, ■<>,*)’*.  (9-28) 

i=  1 

where  the  weights  Wio,h  may  be  the  same  weights  as  used  for  estimation  of  m(x o). 

Central  conditional  moments  can  then  be  computed  by  reexpressing  them  as 
weighted  sums  of  raw  moments.  For  example,  since  V[y \x]  =  E[y2[x]  —  (E[_y|x])2,  the 
conditional  variance  can  be  estimated  by  E[y2|xo]  —  m(x o)2.  One  expects  that  higher 
order  conditional  moments  will  be  estimated  with  more  noise  than  will  be  the  condi¬ 
tional  mean. 


9.5.7.  Multivariate  Kernel  Regression 


We  have  focused  on  kernel  regression  on  a  single  regressor.  For  regression  of  scalar  y 
on  k-dimensional  vector  x,  that  is,  y,  =  m(Xj)  +  £,•  =  m(x u, . . . ,  xia)  +  £;,  the  kernel 
estimator  of  m(x o)  becomes 


m(x0)  = 


NUyJ~n=_ ljM  h  )  -V' 

_J_  K  pi-*))  ’ 

Nhk  2-, i=  l  A  l  /,  ) 


where  K(-)  is  now  a  multivariate  kernel.  Often  K(-)  is  the  product  of  k  one¬ 
dimensional  kernels,  though  multivariate  kernels  such  as  the  multivariate  normal  den¬ 
sity  can  be  used. 

If  a  product  kernel  is  used  the  regressors  should  be  transformed  to  a  common  scale 
by  dividing  by  the  standard  deviation.  Then  the  cross-validation  measure  (9.25)  can 
be  used  to  determine  a  common  optimal  bandwidth  h*,  though  determining  which  x, 
should  be  downweighted  as  the  result  of  closeness  to  the  end  points  is  more  compli¬ 
cated  when  x  is  multivariate.  Alternatively,  regressors  need  not  be  rescaled,  but  then 
different  bandwidths  should  be  used  for  each  regressor. 

The  asymptotic  results  and  expressions  are  similar  to  those  considered  before,  as  the 
estimate  is  again  a  local  average  of  the  yt.  The  bias  b(x o)  is  again  0(h2)  as  before,  but 
the  variance  of  m(xo)  declines  at  a  rate  0(Nhk),  slower  than  in  the  one-dimensional 
case  since  essentially  a  smaller  fraction  of  the  sample  is  being  used  to  form  m(xo). 
Then 


V Nhk(m{x0 )  —  hi (x0)  —  b(x0))  -4-  A f 


0, 


/(x  o) 


/ 


K{z)2dz 


The  optimal  bandwidth  choice  is  h*  =  0(N  VWHb),  which  is  larger  than  0(N  0  2)  in 
the  one-dimensional  case.  The  corresponding  optimal  rate  of  convergence  of  m(x{))  is 

jy-2/a+4)_ 

This  result  and  the  earlier  scalar  result  assumes  that  m(x)  is  twice  differentiable,  a 
necessary  assumption  to  obtain  the  bias  term  in  (9.23).  If  m(x)  is  instead  p  times  dif¬ 
ferentiable  then  kernel  estimation  using  a  pi\i  order  kernel  (see  Section  9.3.3)  reduces 
the  order  of  the  bias,  leading  to  smaller  h*  and  faster  rates  of  convergence  that  attain 
Stone’s  bound  given  in  Section  9.4.5;  see  Hardle  (1990,  p.  93)  for  further  details.  Other 
nonparametric  estimators  given  in  the  next  section  can  also  attain  Stone’s  bound. 
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The  convergence  rate  decreases  as  the  number  of  regressors  increases,  approaching 
N°  as  the  number  of  regressors  approaches  infinity.  This  curse  of  dimensionality 
greatly  restricts  the  use  of  nonparametric  methods  in  regression  models  with  several 
regressors.  Semiparametric  models  (see  Section  9.7)  place  additional  structure  so  that 
the  nonparametric  components  are  of  low  dimension. 


9.5.8.  Tests  of  Parametric  Models 

An  obvious  test  of  correct  specification  of  a  parametric  model  of  the  conditional  mean 
is  to  compare  the  fitted  mean  with  that  obtained  from  a  nonparametric  model. 

Let  rhg{x)  denote  a  parametric  estimator  of  E[y  |x]  and  m/,(x)  denote  a  nonparamet¬ 
ric  estimator  such  as  a  kernel  estimator.  One  approach  is  to  compare  rhg(x)  with  /h/,(x) 
at  a  range  of  values  of  x.  This  is  complicated  by  the  need  to  correct  for  asymptotic 
bias  in  «2/,(x)  (see  Hardle  and  Mammen,  1993).  A  second  approach  is  to  consider  con¬ 
ditional  moment  tests  of  the  form  N~l  JT  uj,(v,  —  nig(Xj)),  where  different  weights, 
based  in  part  on  kernel  regression,  test  failure  of  E[y|x]  =  m«(x)  in  different  direc¬ 
tions.  For  example,  Horowitz  and  Hardle  (1994)  use  w,-  =  nih(x ,)  —  rhg(Xj).  Pagan 
and  Ullah  (1999,  pp.  141-150)  and  Yatchew  (2003,  pp.  119-124)  survey  some  of  the 
methods  used. 


9.6.  Alternative  Nonparametric  Regression  Estimators 

Section  9.4  introduced  local  regression  methods  that  estimate  the  regression  function 
m(x o)  by  a  local  weighted  average  m(xo)  =  JT  Wjo.nyi,  where  the  weights  Wjo.h  = 
w(xj,x o,  h)  differ  with  the  point  of  evaluation  xq  and  the  sample  value  of  x,-.  Section 
9.5  presented  detailed  results  when  the  weights  are  kernel  weights. 

Here  we  consider  other  commonly  used  local  estimators  that  correspond  to  other 
weights.  Many  of  the  results  of  Section  9.5  carry  through,  with  similar  optimal  rates 
of  convergence  and  use  of  cross-validation  for  bandwidth  selection,  though  the  exact 
expressions  for  bias  and  variance  differ  from  those  in  (9.23)  and  (9.24).  The  estimators 
given  in  Section  9.6.2  are  especially  popular. 


9.6.1.  Nearest  Neighbors  Estimator 

The  /(.  nearest  neighbor  estimator  is  the  equally  weighted  average  of  the  y  values  for 
the  k  observations  of  x,-  closest  to  xq.  Define  Nk(x o)  to  be  the  set  of  k  observations  of 
Xi  closest  to  xq.  Then 

1  N 

»lk-NN(x  o)  =  -  £  Kx,  e  Nk(xQ))yt.  (9.29) 

i= 1 

This  estimator  is  a  kernel  estimator  with  uniform  weights  (see  Table  9. 1 )  except  that 
the  bandwidth  is  variable.  Here  the  bandwidth  ho  at  xq  equals  the  distance  between 
xq  and  the  furthest  of  the  k  nearest  neighbors,  and  more  formally  Iiq  ~  k /(2Nf(xo)). 
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The  quantity  k/N  is  called  the  span.  Smoother  curves  can  be  obtained  by  using  kernel 
weights  in  (9.29). 

The  estimator  has  the  attraction  of  providing  a  simple  rule  for  variable  bandwidth 
selection.  It  is  computationally  faster  to  use  a  symmetrized  version  that  uses  the  k/2 
nearest  neighbors  to  the  left  and  a  similar  number  to  the  right,  which  is  the  local  run¬ 
ning  average  method  used  in  Section  9.4.2.  Then  one  can  use  an  updating  formula  on 
observations  ordered  by  increasing  x,  ,  as  then  one  observation  leaves  the  data  and  one 
enters  as  xo  increases. 


9.6.2.  Local  Linear  Regression  and  Lowess 

The  kernel  regression  estimator  is  a  local  constant  estimator  because  it  assumes  that 
m(x)  equals  a  constant  in  the  local  neighborhood  of  xo.  Instead,  one  can  let  mix)  be 
linear  in  the  neighborhood  of  xo,  so  that  mix)  =  ao  +  bo(x  —  xo)  in  the  neighborhood 
of  Xo- 

To  implement  this  idea,  note  that  the  kernel  regression  estimator  m(x o)  can  be  ob¬ 
tained  by  minimizing  JT  K  ((x,-  —  xo)/  h)  (y,  —  mo)2  with  respect  to  mo.  The  local 

linear  regression  estimator  minimizes 


Ek 


(. yi  -  do  -  b0{Xj  -  xo))2, 


(9.30) 


with  respect  to  ao  and  bo,  where  K(-)  is  a  kernel  weighting  function.  Then  mix)  = 
ao  +  bo(x  —  xo)  in  the  neighborhood  of  xo-  The  estimate  at  exactly  xo  is  then  mix)  = 
ao,  and  bo  provides  an  estimate  of  the  first  derivative  m'(x o).  More  generally,  a  local 

polynomial  estimator  of  degree  p  minimizes 


Ek 


,  ,  ,  (Xi  -  XO  y  2 

(Vi  -  flo.o  -  do  1  (x,  -  X0) - ao.p - ; - ) 

p- 


(9.31) 


yielding  «?(i,(x o)  =  ao,.s- 

Fan  and  Gijbels  (1996)  list  many  properties  and  attractions  of  this  method.  Esti¬ 
mation  entails  only  weighted  least-squares  regression  at  each  evaluation  point  xo-  The 
estimators  can  be  expressed  as  a  weighted  average  of  y,-,  since  they  are  LS  estimators. 
The  local  linear  estimator  has  bias  term  b(xo)  =  h2  (\m"(xo))  f  z2K(z)dz,  which,  un¬ 
like  the  bias  for  kernel  regression  given  in  (9.23),  does  not  depend  on  m'(x o).  This 
is  especially  beneficial  for  overcoming  the  boundary  problems  illustrated  in  Section 
9.4.2.  For  estimating  an  vth-ordcr  derivative  a  good  choice  of  p  is  p  =  s  +  1  so  that, 
for  example,  one  uses  a  local  quadratic  estimator  to  estimate  the  first  derivative. 

A  standard  local  regression  estimator  is  the  locally  weighted  scatterplot  smoothing 
or  Lowess  estimator  of  Cleveland  (1979).  This  is  a  variant  of  local  polynomial  estima¬ 
tion  that  in  (9.31)  uses  a  variable  bandwidth  fio.r  determined  by  the  distance  from  xo  to 
its  kth  nearest  neighbor;  uses  the  tricubic  kernel  K(z)  =  (70/8 1  )(1  —  |z|3)3l(|z|  <  1); 
and  downweights  observations  with  large  residuals  y,-  —  m(xi  ),  which  requires  passing 
through  the  data  N  times.  For  a  summary  see  Fan  and  Gijbels  (1996,  p.  24).  Fowess 
is  attractive  compared  to  kernel  regression  as  it  uses  a  variable  bandwidth,  robustifies 
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against  outliers,  and  uses  a  local  polynomial  estimator  to  minimize  boundary  prob¬ 
lems.  However,  it  is  computationally  intensive. 

Another  popular  variation  is  the  supersmoother  of  Friedman  (1984)  (see  Hardle, 
1990,  p.  181).  The  starting  point  is  symmetrized  k-NN,  using  local  linear  fit  rather  than 
local  constant  fit  for  better  fit  at  the  boundary.  Rather  than  use  a  fixed  span  or  fixed 
k,  however,  the  supersmoother  is  a  variable  span  smoother  where  the  variable  span  is 
determined  by  local  cross-validation  that  entails  nine  passes  over  the  data.  Compared 
to  Lowess  the  supersmoother  does  not  robustify  against  outliers,  but  it  permits  the  span 
to  vary  and  is  fast  to  compute. 


9.6.3.  Smoothing  Spline  Estimator 

The  cubic  smoothing  spline  estimator  m\(x)  minimizes  the  penalized  residual  sum 
of  squares 

N 

PRSS(k)  =  -  ™(A/))2  +  * 

1  =  1 

where  X  is  a  smoothing  parameter.  As  elsewhere  in  this  chapter  squared  error  loss  is 
used.  The  first  term  alone  leads  to  a  very  rough  fit  since  then  m(x,)  =  y, .  The  second 
term  is  introduced  to  penalize  roughness.  The  cross-validation  methods  of  Section 
9.5.3  can  be  used  to  determine  X,  with  larger  values  of  X  leading  to  a  smoother  curve. 

Hardle  (1990,  pp.  56-65)  shows  that  mxix)  is  a  cubic  polynomial  between  succes¬ 
sive  x -values  and  that  the  estimator  can  be  expressed  as  a  local  weighted  average  of 
the  ys  and  is  asymptotically  equivalent  to  a  kernel  estimator  with  a  particular  variable 
kernel.  In  microeconometrics  smoothing  splines  are  used  less  frequently  than  the  other 
methods  presented  here.  The  approach  can  be  adapted  to  other  roughness  penalties  and 
other  loss  functions. 


/ 


(. m"(x ))  dx, 


(9.32) 


9.6.4.  Series  Estimators 

Series  estimators  approximate  a  regression  function  by  a  weighted  sum  of  K  functions 
ZiO),  ...,zK(x), 

K 

mK{x)  =  ^2  PjZj(x),  (9.33) 

7=1 

where  the  coefficients  /Sj,  . . . ,  j3K  are  simply  obtained  by  OLS  regression  of  y  on 
Zi(x), ....  z,k(x).  The  functions  Zi(x), ....  Zk(x)  form  a  truncated  series.  Examples 
include  a  (K  —  l)th-order  polynomial  approximation  or  power  series  with  Zj(x)  = 
xj~\  j  =  1, . . . ,  K\  orthogonal  and  orthonormal  polynomial  variants  (see  Section 
12.3.1);  truncated  Fourier  series  where  the  regressor  is  rescaled  so  that  x  e  [0,  27t]; 
the  Fourier  flexible  functional  form  of  Gallant  (1981),  which  is  a  truncated  Fourier 
series  plus  the  terms  x  and  x2;  and  regression  splines  that  approximate  the  regres¬ 
sion  function  m(x)  by  polynomial  functions  between  a  given  number  of  knots  that  are 
joined  at  the  knots. 
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The  approach  differs  from  that  in  Section  9.4  as  it  is  a  global  approximation  ap¬ 
proach  to  estimation  of  m(x),  rather  than  a  local  approach  to  estimation  of  m(x, o). 
Nonetheless,  mx{x)  -4-  m(x o)  if  K  — >  oo  at  an  appropriate  rate  as  IV  — >  oo.  From 
Newey  (1997)  if  x  is  k  dimensional  and  m(x)  is  p  times  differentiable  the  mean  in¬ 
tegrated  squared  error  (see  Section  9.5.3)  MISE(fi)  =  0(K^2p^k  +  K/N ),  where  the 
first  term  reflects  bias  and  the  second  term  variance.  Equating  these  gives  the  optimal 
K*  =  Nk/(2p+k),  so  K  grows  but  at  slower  rate  than  the  sample  size.  The  convergence 
rate  of  mK*(x)  equals  the  fastest  possible  rate  of  Stone  (1980),  given  in  Section  9.4.5. 
Intuitively,  series  estimators  may  not  be  robust  as  outliers  may  have  a  global  rather 
than  merely  local  impact  on  m{x),  but  this  conjecture  is  not  tested  in  typical  examples 
given  in  texts. 

Andrews  (1991)  and  Newey  (1997)  give  a  very  general  treatment  that  includes 
the  multivariate  case,  estimation  of  functionals  other  than  the  conditional  mean, 
and  extensions  to  semiparametric  models  where  series  methods  are  most  often 
used. 


9.7.  Semiparametric  Regression 

The  preceding  analysis  has  emphasized  regression  models  without  any  structure.  In 
microeconometrics  some  structure  is  usually  placed  on  the  regression  model. 

First,  economic  theory  may  place  some  structure,  such  as  symmetry  and  homo¬ 
geneity  restrictions,  in  a  demand  function.  Such  information  may  be  incorporated  into 
nonparametric  regression;  see,  for  example,  Matzkin  (1994). 

Second,  and  more  frequently,  econometric  models  include  so  many  potential  regres¬ 
sors  that  the  curse  of  dimensionality  makes  fully  nonparametric  analysis  impractical. 
Instead,  it  is  common  to  estimate  a  semiparametric  model  that  loosely  speaking  com¬ 
bines  a  parametric  component  with  a  nonparametric  component;  see  Powell  ( 1994)  for 
a  careful  discussion  of  the  term  semiparametric. 

There  are  many  different  semiparametric  models  and  myriad  methods  are  often 
available  to  consistently  estimate  these  models.  In  this  section  we  present  just  a  few 
leading  examples.  Applications  are  given  elsewhere  in  this  book,  including  the  binary 
outcome  models  and  censored  regression  models  given  in  Chapters  14  and  16. 

9.7.1.  Examples 

Table  9.2  presents  several  leading  examples  of  semiparametric  regression.  The  first 
two  examples,  detailed  in  the  following,  generalize  the  linear  model  x'/3  by  adding 
an  unspecified  component  A(z)  or  by  permitting  an  unspecified  transformation  g(x'fi), 
whereas  the  third  combines  the  first  two.  The  next  three  models,  used  more  in  ap¬ 
plied  statistics  than  econometrics,  reduce  the  dimensionality  by  assuming  additivity 
or  separability  of  the  regressors  but  are  otherwise  nonparametric.  We  detail  the  gen¬ 
eralized  additive  model.  Related  to  these  are  neural  network  models;  see  Kuan  and 
White  (1994).  The  last  example,  also  detailed  in  the  following,  is  a  flexible  model  of 
the  conditional  variance.  Care  needs  to  be  taken  to  ensure  that  semiparametric  models 
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Table  9.2.  Semiparametric  Models:  Leading  Examples 


Name 

Model 

Parametric 

Nonparametric 

Partially  linear 

E[y  x,  z]  =  x'/3  +  k(z) 

(3 

M-) 

Single  index 

E[y|x]  =  g(x'(3) 

(3 

*(■) 

Generalized  partial 
linear 

E[y|x,  z]  =  g(x'(3  +  k(z)) 

(3 

s(-U(-) 

Generalized  additive 

E[y|x]  =  c+  T,kj=i8j(Xj) 

- 

Sj(') 

Partial  additive 

E[y|x,  z]  =  x'(3  +  c+  J2kj=i  gjizj ) 

(3 

8j( •) 

Projection  pursuit 

E[y|x]  =  Sjix'jPj) 

Pj 

8j(3 

Heteroskedastic 

linear 

E[y|x]  =  x'/3;  V[y|x]  =  cr2(x) 

(3 

ct2(3 

are  identified.  For  example,  see  the  discussion  of  single-index  models.  In  addition  to 
estimation  of  (3,  interest  also  lies  in  the  marginal  effects  such  as  3E[_y|x,  z]/3x. 


9.7.2.  Efficiency  of  Semiparametric  Estimators 

We  consider  loss  of  efficiency  in  estimating  by  semiparametric  rather  than  parametric 
methods,  ahead  of  presenting  results  for  several  leading  semiparametric  models. 

Our  summary  follows  Robinson  (1988b),  who  considers  a  semiparametric  model 
with  parametric  component  denoted  (3  and  nonparametric  component  denoted  G  that 
depends  on  infinitely  many  nuisance  parameters.  Examples  of  G  include  the  shape  of 
the  distribution  of  a  symmetrically  distributed  iid  error  and  the  single-index  function 
g(-)  given  in  (9.37)  in  Section  9.7.4.  The  estimator  (3  =  (3(G),  where  G  is  a  nonpara¬ 
metric  estimator  of  G. 

Ideally,  the  estimator  (3  is  adaptive,  meaning  that  there  is  no  efficiency  loss  in 
having  to  estimate  G  by  nonparametric  methods,  so  that 

ViV(3  -  (3)  4  Af[0,  VG], 

where  VG  is  the  covariance  matrix  for  any  shape  function  G  in  the  particular  class  be¬ 
ing  considered.  Within  the  likelihood  framework  VG  is  the  Cramer-Rao  lower  bound. 
In  the  second- moment  context  VG  is  given  by  the  Gauss-Markov  theorem  or  a  gener¬ 
alization  such  as  to  GMM.  A  leading  example  of  an  adaptive  estimator  is  estimation 
with  specified  conditional  mean  function  but  with  unknown  functional  form  for  het- 
eroskedasticity  (see  Section  9.7.6). 

If  the  estimator  (3  is  not  adaptive  then  the  next  best  optimality  property  is  for  the 
estimator  to  attain  the  semiparametric  efficiency  bound  VG,  so  that 

Vn0  -  (3)  4  Af[0,  V*G], 

where  VG  is  a  generalization  of  the  Cramer-Rao  lower  bound  or  its  second-moment 
analogue  that  provides  the  smallest  variance  matrix  possible  given  the  specified 
semiparametric  model.  For  an  adaptive  estimator  VG  =  VG,  but  usually  V*G  exceeds 
V(, .  Semiparametric  efficiency  bounds  are  introduced  in  Section  9.7.8.  They  can  be 
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obtained  only  in  some  semiparametric  settings,  and  even  when  they  are  known  no 
estimator  may  exist  that  attains  the  bound.  An  example  that  attains  the  bound  is  the 
binary  choice  model  estimator  of  Klein  and  Spady  (1993)  (see  Section  14.7.4). 

If  the  semiparametric  efficiency  bound  is  not  attained  or  is  not  known,  then  the  next 
best  property  is  that  \/~N ((3  —  (3)  A/"[0,  Vg*]  for  Vj?  greater  than  V*G,  which  permits 

the  usual  statistical  inference.  More  generally,  \fN{(3  —  (3)  =  Op(  1)  but  is  not  neces¬ 
sarily  normally  distributed.  Finally,  consistent  but  less  than  \Z~N -consistent  estimators 
have  the  property  that  Nr(j3  —  (3)  =  Op(  I ),  where  r  <  0.5.  Often  asymptotic  normal¬ 
ity  cannot  be  established.  This  often  arises  when  the  parametric  and  nonparametric 
parts  are  treated  equally,  so  that  maximization  occurs  jointly  over  (3  and  G.  There  are 
many  examples,  particularly  in  discrete  and  truncated  choice  models. 

Despite  their  potential  inefficiency,  semiparametric  estimators  are  attractive  because 
they  can  retain  consistency  in  settings  where  a  fully  parametric  estimator  is  inconsis¬ 
tent.  Powell  (1994,  p.  2513)  presents  a  table  that  summarizes  the  existence  of  consis¬ 
tent  and  s/~N -consistent  asymptotic  normal  estimators  for  a  range  of  semiparametric 
models. 


9.7.3.  Partially  Linear  Model 

The  partially  linear  model  specifies  the  conditional  mean  to  be  the  usual  linear  re¬ 
gression  function  plus  an  unspecified  nonlinear  component,  so 

E[y|x,  z]  =  x'f3  +  X(z),  (9.34) 

where  the  scalar  function  A(-)  is  unspecified. 

An  example  is  the  estimation  of  a  demand  function  for  electricity,  where  z  reflects 
time-of-day  or  weather  indicators  such  as  temperature.  A  second  example  is  the  sample 
selection  model  given  in  Section  16.5.  Ignoring  X( z)  leads  to  inconsistent  [3  owing  to 
omitted  variables  bias,  unless  Cov[x,  A(z)]  =  0.  In  applications  interest  may  lie  in  [3, 
A(z)  or  both.  Fully  nonparametric  estimation  of  E  [  y  |  x ,  z]  is  possible  but  leads  to  less 
than  yiV-consistent  estimation  of  (3. 


Robinson  Difference  Estimator 

Instead,  Robinson  (1988a)  proposed  the  following  method.  The  regression  model 
implies 

y  —  x'(3  +  A(z)  +  m, 

where  the  error  u  =  y  —  E[y|x,  z].  This  in  turn  implies 

E[y|z]  =  E[x|z]'/3  +  A(z) 

since  E[;/|x,  z]  =  0  implies  E[h|z]  =  0.  Subtracting  the  two  equations  yields 

y  —  E[y|z]  =  (x  —  E[x|z])'/3  +  u.  (9.35) 

The  conditional  moments  in  (9.35)  are  unknown,  but  they  can  be  replaced  by  nonpara¬ 
metric  estimates. 
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Thus  Robinson  proposed  the  OLS  regression  estimation  of 


yi  -  rhyi  —  (x  -  mx,  )'/3  +  v, 


(9.36) 


where  myi  and  mx,  are  predictions  from  nonparametric  regression  of,  respectively,  y, 
and  x,  on  z,  .  Given  independence  over  i,  the  OLS  estimator  of  (3  in  (9.36)  is  -/N 
consistent  and  asymptotically  normal  with 


ViV(3PL  -  P)  4  Af 


/  1  N 

0,  or2  (  plim  —  -  E[x,|z,])(x,  -  E[x,  |z,])' 


-1 


assuming  m,  is  iid  [0,  a2].  Not  specifying  A(z)  generally  leads  to  an  efficiency  loss, 
though  there  is  no  loss  if  E[x|z]  is  linear  in  z.  To  estimate  V[/3PL]  simply  replace 
(x,— E[x,  |z,])  by  (Xj  —  inx,).  The  asymptotic  result  generalizes  to  heteroskedastic  er¬ 
rors,  in  which  case  one  just  uses  the  usual  Eicker-White  standard  errors  from  the  OLS 
regression  (9.36).  Since  a(z)  =  E[y[z]  —  E[x|z]'/3  it  can  be  consistently  estimated  by 
X(z)  =  rhyi  ~  mx,-'/3. 

A  variety  of  nonparametric  estimators  myi  and  mx,  can  be  used.  Robinson  (1988a) 
used  kernel  estimates  that  require  convergence  at  rate  no  slower  than  N  1  /4  so  that 
oversmoothing  or  higher  order  kernels  are  needed  if  the  dimension  of  z  is  large;  see 
Pagan  and  Ullah  (1999,  p.  205).  Note  also  that  the  kernel  estimators  may  be  trimmed 
(see  Section  9.5.3). 


Other  Estimators 

Several  other  methods  lead  to  \fN -consistent  estimates  of  f:i  in  the  partially  linear 
model.  Speckman  (1988)  also  used  kernels.  Engle  et  al.  (1986)  used  a  generalization 
of  the  cubic  smoothing  spline  estimator.  Andrews  (1991)  presented  regression  of  y  on 
x  and  a  series  approximation  for  A(z)  given  in  Section  9.6.4.  Yatchew  (1997)  presents 
a  simple  differencing  estimator. 


9.7.4.  Single-Index  Models 

A  single-index  model  specifies  the  conditional  mean  to  be  an  unknown  scalar  function 
of  a  linear  combination  of  the  regressors,  with 

E[y|x]  =  g(x'f3),  (9.37) 

where  the  scalar  function  g(-)  is  unspecified.  The  advantages  of  single-index  models 
have  been  presented  in  Section  5.2.4.  Here  the  function  g(-)  is  obtained  from  the  data, 
whereas  previous  examples  specified,  for  example,  E[y|x]  =  exp(x'/3). 


Identification 

Ichimura  (1993)  presents  identification  conditions  for  the  single-index  model.  For 
unknown  function  g(-)  the  single-index  model  (3  is  only  identified  up  to  location  and 
scale.  To  see  this  note  that  for  scalar  v  the  function  g*(a  +  bv)  can  always  be  expressed 
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as  g(v),  so  the  function  g*(a  +  bx1  (3)  is  equivalent  to  g(x' (3).  Additionally,  g(-)  must 
be  differentiable.  In  the  simplest  case  all  regressors  are  continuous.  If  instead  some 
regressors  are  discrete,  then  at  least  one  regressor  must  be  continuous  and  if  g(-)  is 
monotonic  then  bounds  can  be  obtained  for  (3. 


Average  Derivative  Estimator 

For  continuous  regressors,  Stoker  (1986)  observed  that  if  the  conditional  mean  is  single 
index  then  the  vector  of  average  derivatives  of  the  conditional  mean  determines  (3  up 
to  scale,  since  for  m(x,)  =  g(\)f3) 


8  =  E 


3;«(x) 

3x 


E[g\x'f3)](3, 


(9.38) 


and  E[g'(x//3)]  is  a  scalar.  Furthermore,  by  the  generalized  information  matrix  equal¬ 
ity  given  in  Section  5.6.3,  for  any  function  fi(x),  E[3/r(x)/3x]  =  — E[/i(x)s(x)],  where 
s(x)  =  d  In  /(x)/3x  =/'(x)//(x)  and  /(x)  is  the  density  of  x.  Thus 


8  —  — E  [m(x)s{x)]  =  -E  [E[y|x]s(x)] .  (9.39) 

It  follows  that  8,  and  hence  /3  up  to  scale,  can  be  estimated  by  the  average  derivative 
(AD)  estimator 


1  N 

<5ad  =  — -£>s(x,),  (9.40) 

i=  1 

where  sfx,)  =  /,(x,)//(x,)  can  be  obtained  by  kernel  estimation  of  the  density  of  x, 
and  its  first  derivative.  The  estimator  8  is  -/N  consistent  and  its  asymptotic  normal 
distribution  was  derived  by  Hardle  and  Stoker  (1989).  The  function  g(-)  can  be  esti¬ 
mated  by  nonparametric  regression  of  v,  on  x'  8.  Note  that  <5 ad  provides  an  estimate 
of  E[m'(  x)]  regardless  of  whether  a  single-index  model  is  relevant. 

A  weakness  of  8 ad  is  that  s(x(  )  can  be  very  large  if  /(x, )  is  small.  One  possibility  is 
to  trim  when  /(x,  )  is  small.  Powell,  Stock,  and  Stoker  (1989)  instead  observed  that  the 
result  (9.38)  extends  to  weighted  derivatives  with  8  =  E\w(x)m'  (x)\.  Especially  con¬ 
venient  is  to  choose  w(x)  =  f(x),  which  yields  the  density  weighted  average  deriva¬ 
tive  (DWAD)  estimator 


1  N  _ 

6Dwad  =  --^y,/'^),  (9.41) 

1  =  1 

which  no  longer  divides  by  /(x,).  This  yields  a  sTN -consistent  and  asymptotically 
normal  estimate  of  / 3  up  to  scale.  For  example,  if  the  first  component  of  / 3  is  normalized 
to  one  then  ^  =  1  and  /3;  =  8j/8 1  for  j  >  1. 

These  methods  require  continuous  regressors  so  that  the  derivatives  exist.  Horowitz 
and  Hardle  (1996)  present  extension  to  discrete  regressors. 
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Semiparametric  Least  Squares 

An  alternative  estimator  of  the  single-index  model  was  proposed  by  Ichimura  (1993). 
Begin  by  assuming  that  g(-)  is  known,  in  which  case  the  WLS  estimator  of  (3 
minimizes 


1  N 

sN(P)  =—  w'M(yi  -  sfr'iP))2- 

i= 1 

For  unknown  g(-)  Ichimura  proposed  replacing  g(x./3)  by  a  nonparametric  estimate 
’g(x'ift),  leading  to  the  weighted  semiparametric  least-squares  (WSLS)  estimator 
ft wsls  that  minimizes 


1  N 

QnU 3)  =—  ^7r(Xi)iUi(x)(y,-  - ?(x-/3))2, 

1=1 

where  7r(x,  )  is  a  trimming  function  that  drops  observations  if  the  kernel  regression 
estimate  of  the  scalar  x'/3  is  small,  and  g(xj/3)  is  a  leave-one-out  kernel  estimator 
from  regression  of  y,  on  x-/3.  This  is  a  y/V-consistcnt  and  asymptotically  normal 
estimate  of  / 3  up  to  scale  that  is  generaly  more  efficient  than  the  DWAD  estimator.  For 
heteroskedastic  data  the  most  efficient  estimator  is  the  analogue  of  feasible  GLS  that 
uses  estimated  weight  function  Wj(x)  =  I/07,  where  ef?  is  the  kernel  estimate  given 
in  (9.43)  of  Section  9.7.6  and  where  m,  =  y,  —  y(x'/3)  and  (3  is  obtained  from  initial 
minimization  of  Qn(J3 )  with  u>,  (x )  =  1. 

The  WSLS  estimator  is  computed  by  iterative  methods.  Begin  with  an  initial  esti¬ 
mator  (3{  l\  such  as  the  DWAD  estimator  with  first  component  normalized  to  one.  Form 
the  kernel  estimate  g(x^/3(l1)  and  hence  Q  ;y  (/)* '  ' ),  perturb  ft' ' 1  to  obtain  the  gradient 

gN(fta))  =  dQN(j3)/df3\-g.i)  and  hence  an  update  ftU)  =  /3U)  +  ANgN(fta)),  and  so 
on.  This  estimator  is  considerably  more  difficult  to  calculate  than  the  DWAD  estima¬ 
tor,  especially  as  Qn(/3)  can  be  nonconvex  and  multimodal. 


9.7.5.  Generalized  Additive  Models 

Generalized  additive  models  specify  E[y|x]  =  gi(xft  +  •  •  •  +gftxft,  a  specializa¬ 
tion  of  the  fully  nonparametric  model  E[y|x]  =  g(x  1, . . . ,  gft.  This  specialization  re¬ 
sults  in  the  estimated  subfunctions  'gj(xj)  converging  at  the  rate  for  a  one-dimensional 
nonparametric  regression  rather  than  the  slower  rate  of  a  k-dimensional  nonparametric 
regression. 

A  well-developed  methodology  exists  for  estimating  such  models  (see  Hastie  and 
Tibsharani,  1990).  This  is  automated  in  some  statistical  packages  such  as  S-Plus.  Plots 
of  the  estimated  subfunctions  'gj(xj)  on  xj  trace  out  the  marginal  effects  of  xj  on 
E[v|x],  so  the  additive  model  can  provide  a  useful  tool  for  exploratory  data  analy¬ 
sis.  The  model  sees  little  use  in  microeconometrics  in  part  because  many  applications 
such  as  censoring,  truncation,  and  discrete  outcomes  lead  naturally  to  single-index  and 
partially  linear  models. 
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9.7.6.  Heteroskedastic  Linear  Model 


The  heteroskedastic  linear  model  specifies 


E[v|x]  =  x'/3, 

V[v|x]  =  er2(x), 

where  the  variance  function  ct2(-)  is  unspecified. 

The  assumption  that  errors  are  heteroskedastic  is  the  standard  cross-section  data 
assumption  in  modem  microecometrics.  One  can  obtain  consistent  but  inefficient  esti¬ 
mates  of  /3  by  doing  OLS  and  using  the  Eicker-White  heteroskedastic -consistent  esti¬ 
mate  of  the  variance  matrix  of  the  OLS  estimator.  Cragg  (1983)  and  Amemiya  (1983) 
proposed  an  IV  estimator  that  is  more  efficient  than  OLS  but  still  not  fully  efficient. 
Feasible  GLS  provides  a  fully  efficient  second-moment  estimator  but  is  not  attractive 
as  it  requires  specification  of  a  functional  form  for  cr2(x)  such  as  cr2(x)  =  cxp(x'7). 

Robinson  (1987)  proposed  a  variant  of  FGLS  using  a  nonparametric  estimator  of 
cr2  =  <x2(x,-).  Then 

=  (ix2™).  (9A2) 

where  Robinson  (1987)  used  a  k-NN  estimator  of  cr2  with  uniform  weight,  so 


I  N 

-  I>v 


e  M(x,  ))n2 


(9.43) 


j=  i 


where  u)  =  y,  —  x'  /3OLS  is  the  residual  from  first-stage  OLS  regression  of  v,  on  x,  and 
Nk(Xj)  is  the  set  of  k  observations  of  x,  closest  to  x,  in  weighted  Euclidean  norm.  Then 


VN(pHLM-f3)Am,M 


assuming  «,  is  iid  [0,  cr2(x,)].  This  estimator  is  adaptive  as  it  attains  the  Gauss- 
Markov  bound  so  is  as  as  efficient  as  the  GLS  estimator  when  cr2  is  known.  The 
variance  matrix  is  consistently  estimated  by  (V  1  JT  efr2x,x.)  *  . 

In  principle  other  nonparametric  estimators  of  cr2(x,)  might  be  used,  but  Carroll 
(1982)  and  others  originally  proposed  use  of  a  kernel  estimator  of  cr2  and  found  that 
proof  of  efficiency  was  possible  only  under  very  restrictive  assumptions  on  x,  .  The 
Robinson  method  extends  to  models  with  nonlinear  mean  function. 


9.7.7.  Seminonparametric  MLE 

Suppose  yt  is  iid  with  specified  density  /(yr  |x,  ,  /3).  In  general,  misspecification  of  the 
density  leads  to  inconsistent  parameter  estimates.  Gallant  and  Nychka  (1987)  proposed 
approximating  the  unknown  true  density  by  a  power-series  expansion  around  the  den¬ 
sity  f(y  |x,  (3).  To  ensure  a  positive  density  they  actually  use  a  squared  power-series 
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expansion  around  /(y|x,  ft),  yielding 


hp(y\x,ft,a) 


(p(y|Q))2/(y|x.  (3) 
f  (p(z\ol))2  f  (y  |z,  (3)dz ' 


(9.44) 


where  p(y  |a)  is  a  pth  order  polynomial  in  y,  a  is  the  vector  of  coefficients  of  the  poly¬ 
nomial,  and  division  by  the  denominator  ensures  that  probabilities  integrate  or  sum  to 
one.  The  estimator  of  (3  and  a  maximizes  the  log-likelihood  ,  In  /j;)(y,  |x,  (3,  a). 
The  approach  generalizes  immediately  to  multivariate  y,  .  The  estimator  is  called  the 
seminonparametric  maximum  likelihood  estimator  because  it  is  a  nonparametric 
estimator  that  can  be  estimated  in  the  same  way  as  a  maximum  likelihood  estimator. 
Gallant  and  Nychka  (1987)  showed  that  under  fairly  general  conditions  the  estimator 
yields  consistent  estimates  of  the  density  if  the  order  p  of  the  polynomial  increases 
with  sample  size  N  at  an  appropriate  rate. 

This  result  provides  a  strong  basis  for  using  (9.44)  to  obtain  a  class  of  flexible  dis¬ 
tributions  for  any  particular  data.  The  method  is  particularly  simple  if  the  polynomial 
series  p(y\a)  is  the  orthogonal  or  orthonormal  polynomial  series  (see  Section  12.3.1) 
for  the  baseline  density  f(y  |x,  (3),  as  then  the  normalizing  factor  in  the  denominator 
can  be  simply  constructed.  The  order  of  the  polynomial  can  be  chosen  using  infor¬ 
mation  criteria,  with  measures  that  penalize  model  complexity  more  than  AIC  used  in 
practice.  Regular  ML  statistical  inference  is  possible  if  one  ignores  the  data-dependent 
selection  of  the  polynomial  order  and  assumes  that  the  resulting  density  ^^(y  |x,  (3,  a) 
is  correctly  specified.  An  example  of  this  approach  for  count  data  regression  is  given 
in  Cameron  and  Johansson  (1997). 


9.7.8.  Semiparametric  Efficiency  Bounds 

Semiparametric  efficiency  bounds  extend  efficiency  bounds  such  as  Cramer-Rao  or 
the  Gauss-Markov  theorem  to  cases  where  the  dgp  has  a  nonparametric  component. 
The  best  semiparametric  methods  achieve  this  efficiency  bound. 

We  use  f3  to  denote  parameters  we  wish  to  estimate,  which  may  include  variance 
components  such  as  a2,  and  77  to  denote  nuisance  parameters.  For  simplicity  we  con¬ 
sider  ML  estimation  with  a  nonparametric  component. 

We  begin  with  the  fully  parametric  case.  The  MLE  (ft,  77)  maximizes  C.(  ft,  rj)  = 
In  L(j3,  77).  Let  6  =  (ft.  rj)  and  let  Too  be  the  information  matrix  defined  in  (5.43). 
Then  yfN(0  —  9)  — >  A/"[0,  T()f'l  |.  For  s/~N((3  —  ft),  partitioned  inversion  of  Too  leads 
to 


V*  =  (Tm  -  1  (9.45) 

as  the  efficiency  bound  for  estimation  of  ft  when  77  is  unknown.  There  is  an  efficiency 
loss  when  77  is  unknown,  unless  the  information  matrix  is  block  diagonal  so  that  T^n  = 
0  and  the  variance  reduces  to  T^. 

Now  consider  extension  to  the  nonparametric  case.  Suppose  we  have  a  paramet¬ 
ric  submodel,  say  Co  (ft),  that  involves  ft  alone.  Consider  the  family  of  all  possible 
parametric  models  C(ft,  rj)  that  nest  Ctftft)  for  some  value  of  rj.  The  semiparametric 
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efficiency  bound  is  the  largest  value  of  V*  given  in  (9.45)  over  all  possible  parametric 
models  C((3,  rj),  but  this  is  difficult  to  obtain. 

Simplification  is  possible  by  considering 

S3  =  —  Efs^ls^], 

where  s#  denotes  the  score  dC/dd,  and  S3  is  the  score  for  (3  after  concentrating  out 
77.  For  finite-dimensional  77  it  can  be  shown  that  E[V  's^s',]  =  V*.  Here  77  is  instead 
infinite  dimensional.  Assume  iid  data  and  let  s«,  denote  the  ith  component  in  the  sum 
that  leads  to  the  score  Sg.  Begun  et  al.  (1983)  define  the  tangent  set  to  be  the  set  of  all 
linear  combinations  of  sVi .  When  this  tangent  set  is  linear  and  closed  the  largest  value 
of  V*  in  (9.45)  equals 

12  =  (plim  N-'sps'f:tf'  =  (Efs^?^])-1. 

The  matrix  12  is  then  the  semiparametric  efficiency  bound. 

In  applications  one  first  obtains  sv  =  sn,  -  Then  obtain  E[s^  |sT/_  ],  which  may 
entail  assumptions  such  as  symmetry  of  errors  that  place  restrictions  on  the  class  of 
semiparametric  models  being  considered.  This  yields  S3,  and  hence  12.  For  more  de¬ 
tails  and  applications  see  Newey  (1990b),  Pagan  and  Ullah  (1999),  and  Severini  and 
Tripathi  (2001). 


9.8.  Derivations  of  Mean  and  Variance  of  Kernel  Estimators 

Nonparametric  estimation  entails  a  balance  between  smoothness  (variance)  and  bias 
(mean).  Here  we  derive  the  mean  and  variance  of  kernel  density  and  kernel  regression 
estimators.  The  derivations  follow  those  of  M.  J.  Lee  (1996). 


9.8.1.  Mean  and  Variance  of  Kernel  Density  Estimator 
Since  x,  are  iid  each  term  in  the  summation  has  the  same  expected  value  and 

E[/(*o)]=E[Km] 

=  fj;K(^)f(X)dx. 

By  change  of  variable  to  z  =  (x  —  xq )/  h  so  that  x  =  xq  +  hz  and  dx/dz  =  h  we 
obtain 

E[/(x0)]  =  J  K(z)f(x0  +  hz)dz. 

A  second-order  Taylor  series  expansion  of  f(x 0  +  hz.)  around  f(x 0)  yields 
E[/(xo)]  =  /  K(z){f(x  0)  +  f(x0)hz  +  \f\x0)(hz)2}dz 

=  fix, 0)  /  Kiz)dz  +  hfixo)  f  zK(z)dz  +  \h2  f"  (x0)  /  z2K(z)dz. 
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Since  the  kernel  K(z)  integrates  to  unity  this  simplifies  to 

E[/(*o)]  -  /(vo)  =  hf(x o)  J  zK{z)dz  +  \_h2  f"  ix 0)  J  z2K(z)dz. 

If  additionally  the  kernel  satisfies  f  zK(z)dz  =  0,  assumed  in  condition  (ii)  in  Section 
9.3.3,  and  second  derivatives  of  /  are  bounded,  then  the  first  term  on  the  right-hand 
side  disappears,  yielding  E[/(a0)]  —  fix  o)  =  b(x  o),  where  b(x0)  is  defined  in  (9.4). 

To  obtain  the  variance  of  fix o),  begin  by  noting  that  if  y,  are  iid  then  V[v]  = 
Af_IV[y]  =  TV^'Efv2]  -  Af  '(E[>])2.  Thus 

V[/(x0)]  =  EE  [(Iff  (Ifl))2]  -  i  (E  [Iff  (l^)])2  . 

Now  by  change  of  variables  and  first-order  Taylor  series  expansion 

E  [{\K  (^X2))2]  =  I  i;K(z)2{f(x  o)  +  f(x0)hz}dz 

=  \fixo)fK(z)2dz  +  f'(x0)f  zK(z)2dz. 

It  follows  that 

V[/(x0)]  =  m/(to)  /  K(z)2dz  +  jjf(x)  f  zK(z)2dz 
-£[/(*  o)  +  y/"(v0)[/  z2K(z)dz]]2. 

For  h  — >  0  and  N  oo  this  is  dominated  by  the  first  term,  leading  to  Equation  (9.5). 


9.8.2.  Distribution  of  Kernel  Regression  Estimator 


We  obtain  the  distribution  for  regressors  xt  that  are  iid  with  density  fix).  From  Section 
9.5.1  the  kernel  estimator  is  a  weighted  average  m(xo)  =  JT  w/o./i}’;,  where  the  kernel 
weights  Wjo.h  are  given  in  (9.22).  Since  the  weights  sum  to  unity  we  have  m(xo)  — 
m(x o)  =  wi0,h(yi  —  m(xo ))•  Substituting  (9.15)  for  yh  and  normalizing  by  s/Nh 
as  in  the  kernel  density  estimator  case  we  have 

N 

V Nh{m(xo)  —  m(xo))  —  V Nh  w,o ,himixi)  ~  mix o)  +  £;)■  (9.46) 

1=1 


One  approach  to  obtaining  the  limit  distribution  of  (9.46)  is  to  take  a  second-order 
Taylor  series  expansion  of  m(x,-)  around  xq.  This  approach  is  not  always  taken  be¬ 
cause  the  weights  u>;o,fc  are  complicated  by  the  normalization  that  they  sum  to  one  (see 
(9.22)). 

Instead,  we  take  the  approach  of  Lee  (1996,  pp.  148-151)  following  Bierens  (1987, 
pp.  106-108).  Note  that  the  denominator  of  the  weight  function  is  the  kernel  estimate 
of  the  density  of  jcq,  since  fixf)  =  ( Nh)~l  JT  K  ((x,-  —  xq )/ h).  Then  (9.46)  yields 


V  Nhim(xo)  —  m(x  o))  = 


Ek 


Xj  -  Xq 

h 


imixi)  -  m(x0)  + 


fix  o)- 
(9.47) 


We  apply  the  Transformation  Theorem  (Theorem  A.  12)  to  (9.47),  using  fix o)  —> 
fix o)  for  the  denominator,  while  several  steps  are  needed  to  obtain  a  limit  normal 
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distribution  for  the  numerator: 


VNhU 


Ek 


Xi  -  Xo 


0 m(xj )  -  m(x o)  +  Si) 


(9.48) 


EK 


VM  frt  V  h 


Xi  -  x0 


{m{xi)-m^))+  —  YjK[^1-)ei. 


Consider  the  first  sum  in  (9.48);  if  a  law  of  large  numbers  can  be  applied  it  converges 
in  probability  to  its  mean 


1 


VM  fr, 
_  Vn 
~JT% 


/=] 

/ 


Xi  -  Xo 

h 

x  —  Xo 
h 


C m(Xi )  -  m(x0)) 


(m(x)  —  m(xo))f(x)dx 


(9.49) 


=  V /V/?  J  K(z)(m(x o  +  hz)  —  w(x0))/(x0  +  fiz)dz 
=  VM  J  K{z)  (hzm'(x o)  +  -h2z2m"(xo)\  (/(x o)  +  hzf'(x o))  dz 


=  VM 


/ 


/*<4 


/f(z)/j  z  m  (x0)/  (x0)dz  +  /  K(z)-h  zm  ( x0)/(x0)dz 


—  sfWhh2  (m'(xo)f(xo)  +  ^m"(x0)/(x0A  J  z2K(z)dz 

—  VnTi  f  (xo)fo(xo), 


where  /?(x‘o)  is  defined  in  (9.23).  The  first  equality  uses  x,  iid;  the  second  equality  is 
change  of  variables  to  z  =  (x  —  xo)/  h  \  the  third  equality  applies  a  second-order  Taylor 
series  expansion  to  m(xo  +  /iz)  and  a  first-order  Taylor  series  expansion  to  /(x o  +  /tz); 
the  fourth  equality  follows  because  upon  expanding  the  product  to  four  terms,  the  two 
terms  given  dominate  the  others  (see,  e.g.,  Lee,  1996,  p.  150). 

Now  consider  the  second  sum  in  (9.48);  the  terms  in  the  sum  clearly  have  mean 
zero,  and  the  variance  of  each  term,  dropping  subscript  i,  is 


V 


K 


x  —  xq 


h 


s 


=  /  R2  V[£| x]/(x)dx 

—  h  J  K2  (z)  V[e|x0  +  hz]f(x0  +  hz)dz 
=  AV[e|x0]/(x0)  j  K2  (z)dz, 


(9.50) 


by  change  of  variables  to  z  =  (x  —  xo )/h  with  dx  =  hdz  in  the  third-line  term,  and 
letting  h  0  to  get  the  last  line.  It  follows  upon  applying  a  central  limit  theorem  that 


VM  jz 


i=  1 


'  X0  \  d  .  , 

si  ->  M 


0,  V[e|x0]/(x0)  /  K2(z)dz 


I 


(9.51) 


332 


9.9.  PRACTICAL  CONSIDERATIONS 


Combining  (9.49)  and  (9.51),  we  have  that  s/Nh(m(xo )  —  m(x o))  defined  in  (9.47) 
converges  to  1  /f(x  o)  times  AT  [jNhf(xo)b(xo),  V[£|xo]/U'o)  /  K2  (z)dz\- Division 
of  the  mean  by  f(x o)  and  the  variance  by  f(x o)2  leads  to  the  limit  distribution  given 
in  (9.24). 


9.9.  Practical  Considerations 

All-purpose  regression  packages  increasingly  offer  adequate  methods  for  univariate 
nonparametric  density  estimation  and  regression.  The  programming  language  XPlore 
emphasizes  nonparametric  and  graphical  methods;  details  on  many  of  the  methods  are 
provided  at  its  Web  site. 

Nonparametric  univariate  density  estimation  is  straightforward,  using  a  kernel  den¬ 
sity  estimate  based  on  a  kernel  such  as  the  Gaussian  or  Epanechnikov.  Easily  computed 
plug-in  estimates  for  the  bandwidth  provide  a  useful  starting  point  that  one  may  then, 
say,  halve  or  double  to  see  if  there  is  an  improvement. 

Nonparametric  univariate  regression  is  also  straightforward,  aside  from  bandwidth 
selection.  If  relatively  unbiased  estimates  of  the  regression  function  at  the  end  points 
are  desired,  then  local  linear  regression  or  Lowess  estimates  are  better  than  kernel 
regression.  Plug-in  estimates  for  the  bandwidth  are  more  difficult  to  obtain  and  cross- 
validation  is  instead  used  (see  Section  9.5.3)  along  with  eyeballing  the  scatterplot  with 
a  fitted  line.  The  degree  of  desired  smoothness  can  vary  with  application.  For  nonpara¬ 
metric  multivariate  regression  such  eyeballing  may  be  impossible. 

Semiparametric  regression  is  more  complicated.  It  can  entail  subtleties  such  as  trim¬ 
ming  and  undersmoothing  the  nonparametric  component  since  typically  estimation 
of  the  parametric  component  involves  averaging  the  nonparametric  component.  For 
such  purposes  one  generally  uses  specialized  code  written  in  languages  such  as  Gauss, 
Matlab,  Splus,  or  XPlore.  For  the  nonparametric  estimation  component  considerable 
computational  savings  can  be  obtained  through  use  of  fast  computing  algorithms  such 
as  binning  and  updating;  see,  for  example,  Fan  and  Gijbels  (1996)  and  Hardle  and 
Linton  (1994). 

All  methods  require  at  some  stage  specification  of  a  bandwidth  or  window  width. 
Different  choices  lead  to  different  estimates  in  finite  samples,  and  the  differences  can 
be  quite  large  as  illustrated  in  many  of  the  figures  in  this  chapter.  By  contrast,  within 
a  fully  parametric  framework  different  researchers  estimating  the  same  model  by  ML 
will  all  obtain  the  same  parameter  estimates.  This  indeterminedness  is  a  detraction  of 
nonparametric  methods,  though  the  hope  is  that  in  semiparametric  methods  at  least  the 
spillover  effects  to  the  parametric  component  of  the  model  may  be  small. 


9.10.  Bibliographic  Notes 

Nonparametric  estimation  is  well  presented  in  many  statistics  texts,  including  Fan  and  Gijbels 
(1996).  Ruppert,  Wand,  and  Carroll  (2003)  present  application  of  many  semiparametric  meth¬ 
ods.  The  econometrics  books  by  Hardle  (1990),  M.  J.  Lee  (1996),  Horowitz  (1998b),  Pagan  and 
Ullah  (1999),  and  Yatchew  (2003)  cover  both  nonparametric  and  semiparametric  estimation. 
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Pagan  and  Ullah  (1999)  is  particularly  comprehensive.  Yatchew  (2003)  is  oriented  to  the  ap¬ 
plied  econometrician.  He  emphasizes  the  partial  linear  and  single-index  models  and  practical 

aspects  of  their  implementation  such  as  computation  of  confidence  intervals. 

9.3  Key  early  references  for  kernel  density  estimation  are  Rosenblatt  (1956)  and  Parzen  (1962). 
Silverman's  (1986)  is  a  classic  book  on  nonparametric  density  estimation. 

9.4  A  quite  general  statement  of  optimal  rates  of  convergence  for  nonparametric  estimators  is 
given  in  Stone  (1980). 

9.5  Kernel  regression  estimation  was  proposed  by  Nadaraya  (1964)  and  Watson  (1964).  A 
very  helpful  and  relatively  simple  survey  of  kernel  and  nearest-neighbors  regression  is  by 
Altman  (1992).  There  are  many  other  surveys  in  the  statistics  literature.  Hardle  (1990,  chap¬ 
ter  5)  has  a  lengthy  discussion  of  bandwidth  choice  and  confidence  intervals. 

9.6  Many  approaches  to  nonparametric  local  regression  are  contained  in  Stone  (1977).  For 
series  estimators  see  Andrews  (1991)  and  Newey  (1997). 

9.6  For  semiparametric  efficiency  bounds  see  the  survey  by  Newey  (1990b)  and  the  more  recent 
paper  by  Severini  and  Tripathi  (2001).  An  early  econometrics  application  was  given  by 
Chamberlain  (1987). 

9.7  The  econometrics  literature  focuses  on  semiparametric  regression.  Survey  papers  include 
those  by  Powell  (1994),  Robinson  (1988b),  and,  at  a  more  introductory  level,  Yatchew 
(1998).  Additional  references  are  given  in  elsewhere  in  this  book,  notably  in  Sections  14.7, 
15.11,  16.9,  20.5,  and  23.8.  The  applied  study  by  Bellemare,  Melenberg,  and  Van  Soest 
(2002)  illustrates  several  semiparametric  methods. 


- Exercises - 

9-1  Suppose  we  obtain  a  kernel  density  estimate  using  the  uniform  kernel  (see 

Table  9.1)  with  h=  1  and  a  sample  of  size  N  =  100.  Suppose  in  fact  the  data 

x~7V[0, 1]. 

(a)  Calculate  the  bias  of  the  kernel  density  estimate  at  x0  =  1  using  (9.4). 

(b)  Is  the  bias  large  relative  to  the  true  value  0(1),  where  0(-)  is  the  standard 
normal  pdf? 

(c)  Calculate  the  variance  of  the  kernel  density  estimate  at  x0  =  1  using  (9.5). 

(d)  Which  is  making  a  bigger  contribution  to  MSE  at  Xq  =  1,  variance  or  bias 
squared? 

(e)  Using  results  in  Section  9.3.7,  give  a  95%  confidence  interval  for  the  density 
at  x0  =  1  based  on  the  kernel  density  estimate  f(  1). 

(f)  For  this  example,  what  is  the  optimal  bandwidth  h*  from  (9.10). 

9-2  Suppose  we  obtain  a  kernel  regression  estimate  using  a  uniform  kernel  (see 

Table  9.1)  with  h=  1  and  a  sample  of  size  N  =  100.  Suppose  in  fact  the  data 

x  ~  Af[0, 1]  and  the  conditional  mean  function  is  m(x)  =  x2. 

(a)  Calculate  the  bias  of  the  kernel  regression  estimate  at  x0  =  1  using  (9.23). 

(b)  Is  the  bias  large  relative  to  the  true  value  m(1)  =  1? 

(c)  Calculate  the  variance  of  the  kernel  regression  estimate  at  x0  =  1  using 
(9.24). 

(d)  Which  is  making  a  bigger  contribution  to  MSE  at  Xq  =  1,  variance  or  bias 
squared? 
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(e)  Using  results  in  Section  9.5.4,  give  a  95%  confidence  interval  for  E[y|Xo  =  1] 
based  on  the  kernel  regression  estimate  m(1). 

9-3  This  question  assumes  access  to  a  nonparametric  density  estimation  program. 
Use  the  Section  4.6.4  data  on  health  expenditure.  Use  a  kernel  density  estimate 
with  Gaussian  kernel  (if  available). 

(a)  Obtain  the  kernel  density  estimate  for  health  expenditure,  choosing  a  suitable 
bandwidth  by  eyeballing  and  trial  and  error.  State  the  bandwidth  chosen. 

(b)  Obtain  the  kernel  density  estimate  for  natural  logarithm  of  health  expenditure, 
choosing  a  suitable  bandwidth  by  eyeballing  and  trial  and  error.  State  the 
bandwidth  chosen. 

(c)  Compare  your  answer  in  part  (b)  to  an  appropriate  histogram. 

(d)  If  possible  superimpose  a  fitted  normal  density  on  the  same  graph  as  the 
kernel  density  estimate  from  part  (b).  Do  health  expenditures  appear  to  be 
log-normally  distributed? 

9-4  This  question  assumes  access  to  a  kernel  regression  program  or  other  non¬ 
parametric  smoother.  Use  the  complete  sample  of  the  Section  4.6.4  data 
on  natural  logarithm  of  health  expenditure  (y)  and  natural  logarithm  of  total 
expenditure  (x). 

(a)  Obtain  the  kernel  regression  density  estimate  for  health  expenditure,  choos¬ 
ing  a  good  bandwidth  by  eyeballing  and  trial  and  error.  State  the  bandwidth 
chosen. 

(b)  Given  part  (a),  does  health  appear  to  be  a  normal  good? 

(c)  Given  part  (a),  does  health  appear  to  be  a  superior  good? 

(d)  Compare  your  nonparametric  estimates  with  predictions  from  linear  and 
quadratic  regression. 
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Numerical  Optimization 


10.1.  Introduction 

Theoretical  results  on  consistency  and  the  asymptotic  distribution  of  an  estimator  de¬ 
fined  as  the  solution  to  an  optimization  problem  were  presented  in  Chapters  5  and  6. 
The  more  practical  issue  of  how  to  numerically  obtain  the  optimum,  that  is,  how  to 
calculate  the  parameter  estimates,  when  there  is  no  explicit  formula  for  the  estimator, 
comprises  the  subject  of  this  chapter. 

For  the  applied  researcher  estimation  of  standard  nonlinear  models,  such  as  logit, 
probit,  Tobit,  proportional  hazards,  and  Poisson,  is  seemingly  no  different  from  es¬ 
timation  of  an  OLS  model.  A  statistical  package  obtains  the  estimates  and  reports 
coefficients,  standard  errors,  f-statistics,  and  p-values.  Computational  problems  gen¬ 
erally  only  arise  for  the  same  reasons  that  OLS  may  fail,  such  as  multicollinearity  or 
incorrect  data  input. 

Estimation  of  less  standard  nonlinear  models,  including  minor  variants  of  a  standard 
model,  may  require  writing  a  program.  This  may  be  possible  within  a  standard  statisti¬ 
cal  package.  If  not,  then  a  programming  language  is  used.  Especially  in  the  latter  case 
a  knowledge  of  optimization  methods  becomes  necessary. 

General  considerations  for  optimization  are  presented  in  Section  10.2.  Various  iter¬ 
ative  methods,  including  the  Newton-Raphson  and  Gauss-Newton  gradient  methods, 
are  described  in  Section  10.3.  Practical  issues,  including  some  common  pitfalls,  are 
presented  in  Section  10.4.  These  issues  become  especially  relevant  when  the  opti¬ 
mization  method  fails  to  produce  parameter  estimates. 


10.2.  General  Considerations 

Microeconometric  analysis  is  often  based  on  an  estimator  6  that  maximizes  a  stochas¬ 
tic  objective  function  Qn(9),  where  usually  0  solves  the  first-order  conditions 
dQN(d)/dd  =  0.  A  minimization  problem  can  be  recast  as  a  maximization  by  mul¬ 
tiplying  the  objective  function  by  minus  one.  In  nonlinear  applications  there  will 
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generally  be  no  explicit  solution  to  the  first-order  conditions,  a  nonlinear  system  of 
q  equations  in  the  q  unknowns  9. 

A  grid  search  procedure  is  usually  impractical  and  iterative  methods,  usually  gradi¬ 
ent  methods,  are  employed. 


10.2.1.  Grid  Search 

In  grid  search  methods,  the  procedure  is  to  select  many  values  of  9  along  a  grid, 
compute  Qn(9)  for  each  of  these  values,  and  choose  as  the  estimator  9  the  value  that 
provides  the  largest  (locally  or  globally  depending  on  the  application)  value  of  Qn(9). 

If  a  fine  enough  grid  can  be  chosen  this  method  will  always  work.  It  is  generally 
impractical,  however,  to  choose  a  tine  enough  grid  without  further  restrictions.  For 
example,  if  10  parameters  need  to  be  estimated  and  the  grid  evaluates  each  parameter 
at  just  10  points,  a  very  sparse  grid,  there  are  101()  or  10  billion  evaluations. 

Grid  search  methods  are  nonetheless  useful  in  applications  where  the  grid  search 
need  only  be  performed  among  a  subset  of  the  parameters.  They  also  permit  viewing 
the  response  surface  to  verify  that  in  using  iterative  methods  one  need  not  be  concerned 
about  multiple  maxima.  For  example,  many  time-series  packages  do  this  for  the  scalar 
AR(1)  coefficient  in  a  regression  model  with  AR(1)  error.  A  second  example  is  doing  a 
grid  search  for  the  scalar  inclusive  parameter  in  a  nested  logit  model  (see  Section  15.6). 
Of  course,  grid  search  methods  may  have  to  be  used  if  nothing  else  works. 


10.2.2.  Iterative  Methods 

Virtually  all  microeconometric  applications  instead  use  iterative  methods.  These 
update  the  current  estimate  of  9  using  a  particular  rule.  Given  an  .vth-round  estimate  9S 
the  iterative  method  provides  a  rule  that  yields  a  new  estimate  0s+i,  where  9S  denotes 
the  .vth-round  estimate  rather  than  the  .vth  component  of  9.  Ideally,  the  new  estimate  is 
a  move  toward  the  maximum,  so  that  Qn(9s+\)  >  Qn(9s),  but  in  general  this  cannot 
be  guaranteed.  Also,  gradient  estimates  may  find  a  local  maximum  but  not  necessarily 
the  global  maximum. 


10.2.3.  Gradient  Methods 


Most  iterative  methods  are  gradient  methods  that  change  9S  in  a  direction  determined 
by  the  gradient.  The  update  formula  is  a  matrix  weighted  average  of  the  gradient 


^s+i  —  9S  T  Ajgj,  s  —  1, ...  i  S, 
where  A,  is  a  q  x  q  matrix  that  depends  on  9S,  and 

_  dQ„(0) 
gs  39  7l 


(10.1) 


(10.2) 


is  the  q  x  1  gradient  vector  evaluated  at  9S.  Different  gradient  methods  use  differ¬ 
ent  matrices  As,  detailed  in  Section  10.3.  A  leading  example  is  the  Newton-Raphson 
method,  which  sets  As  =  — Hv  ',  where  Hv  is  the  Hessian  matrix  defined  later  in  (10.6). 


337 


NUMERICAL  OPTIMIZATION 


Note  that  in  this  chapter  A  and  g  denote  quantities  that  differ  from  those  in  other  chap¬ 
ters.  Here  A  is  not  the  matrix  that  appears  in  the  limit  distribution  of  an  estimator  and 
g  is  not  the  conditional  mean  of  y  in  the  nonlinear  regression  model. 

Ideally,  the  matrix  As  is  positive  definite  for  a  maximum  (or  negative  definite  for 
a  minimum),  as  then  it  is  likely  that  Qn(Os+ i)  >  Qn(Os).  This  follows  from  the  first- 
order  Taylor  series  expansion  Qn(Os+i)  =  Qn(Os)  +  gj(0s+i  ~  @s)  +  6',  where  R  is 
a  remainder.  Substituting  in  the  update  formula  (10.1)  yields 

0«(^s+i)  —  Qn(0s)  =  g^A,gs  +  R, 

which  is  greater  than  zero  if  As  is  positive  definite  and  the  remainder  R  is  sufficiently 
small,  since  for  a  positive  definite  square  matrix  A  the  quadratic  form  x'Ax  >  0  for 
all  column  vectors  x/0.  Too  small  a  value  of  Ay  leads  to  an  iterative  procedure  that 
is  too  slow;  however,  too  large  a  value  of  A.s  may  lead  to  overshooting,  even  if  A.s  is 
positive  definite,  as  the  remainder  term  cannot  be  ignored  for  large  changes. 

A  common  modification  to  gradient  methods  is  to  add  a  step-size  adjustment  to 
prevent  possible  overshooting  or  undershooting,  so 

0.s+i  =  Os  +  kjAsgs,  (10.3) 

where  the  stepsize  As.  is  a  scalar  chosen  to  maximize  Qn(Os+\).  At  the  ,v t h  round 
first  calculate  Asgv,  which  may  involve  considerable  computation.  Then  calculate 
Qn(0),  where  6  =  0S  +  AAvgs  for  a  range  of  values  of  A  (called  a  fine  search), 
and  choose  Av  as  that  A  that  maximizes  Qm(6).  Considerable  computational  savings 
are  possible  because  the  gradient  and  As  are  not  recomputed  along  the  line  search. 

A  second  modification  is  sometimes  made  when  the  matrix  A,  is  defined  as  the 
inverse  of  a  matrix  Bv,  say,  so  that  As  =  B  “ 1 .  Then  if  B.s  is  close  to  singular  a  matrix 
of  constants,  say  C,  is  added  or  subtracted  to  permit  inversion,  so  As  =  (Bs  +  C)-1. 
Similar  adjustments  can  be  made  if  As  is  not  positive  definite.  Further  discussion  of 
computation  of  As  is  given  in  Section  10.3. 

Gradient  methods  are  most  likely  to  converge  to  the  local  maximum  nearest  the 
starting  values.  If  the  objective  function  has  multiple  local  optima  then  a  range  of 
starting  values  should  be  used  to  increase  the  chance  of  finding  the  global  maximum. 


10.2.4.  Gradient  Method  Example 

Consider  calculation  of  the  NLS  estimator  in  the  exponential  regression  model  when 
the  only  regressor  is  the  intercept.  Then  E[y]  =  e1'1  and  a  little  algebra  yields  the  gra¬ 
dient  g  =  £T(y;  —  =  (y  —  e^)e^ .  Suppose  in  (10.1)  we  use  As  =  e~2^s, 

which  corresponds  to  the  method  of  scoring  variant  of  the  Newton-Raphson  algo¬ 
rithm  presented  later  in  Section  10.3.2.  The  iterative  method  simplifies  to  /3S+1  = 

Ps  +(T  -&)/&. 

As  an  example  of  the  performance  of  this  algorithm,  suppose  y  =  2  and  the  starting 
value  is  /; ,  =0.  This  leads  to  the  iterations  listed  in  Table  10.1.  There  is  very  rapid 
convergence  to  the  NLS  estimate,  which  for  this  simple  example  can  be  analytically 
obtained  as  /3  =  In  y  =  In  2  =  0.693147.  The  objective  function  increases  throughout, 
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Table  10.1.  Gradient  Method  Results 


Round 

Estimate 

Gradient 

Objective  Function 

s 

% 

gs 

Qn<$s)  =  -27v  E,(>v  -  e?)2 

1 

0.000000 

1.000000 

1.500000  -Ei.V,2/2 N 

2 

1.000000 

-1.952492 

1.742036  —  yf/2N 

3 

0.735758 

-0.181711 

1.996210  —  y?/2N 

4 

0.694042 

-0.003585 

1.999998  -  yf/2N 

5 

0.693147 

-0.000002 

2.000000  -  £.  yf/2N 

a  consequence  of  use  of  the  NR  algorithm  with  globally  concave  objective  function. 
Note  that  overshooting  occurs  in  the  first  iteration,  from  f$ ,  =  0.0  to  fi2  =  1.0,  greater 
than  =  0.693. 

Quick  convergence  usually  occurs  when  the  NR  algorithm  is  used  and  the  objective 
function  is  globally  concave.  The  challenge  in  practice  is  that  nonstandard  nonlinear 
models  often  have  objective  functions  that  are  not  globally  concave. 


10.2.5.  Method  of  Moments  and  GMM  Estimators 


For  m-estimators  QN(9)  =  N  1  JA  q,(9)  and  the  gradient  g(6)  =  N  1  JA 

dqi(d)/dG. 

For  GMM  estimators  Q n(9)  is  a  quadratic  form  (see  Section  6.3.2)  and  the  gradient 
takes  the  more  complicated  form 


g(0)  = 


N-'^dhi  (9)'/ 30 

i 


x  Wat  x 


N~lJ2  h;(0) 

i 


Some  gradient  methods  can  then  no  longer  be  used  as  they  work  only  for  averages. 
Methods  given  in  Section  10.3  that  can  still  be  used  include  Newton-Raphson,  steepest 
ascent,  DFR  BFG,  and  simulated  annealing. 

Method  of  moments  and  estimating  equations  estimators  are  defined  as  solving  a 
system  of  equations,  but  they  can  be  converted  to  a  numerical  optimization  problem 
similar  to  GMM.  The  estimator  6  that  solves  the  q  equations  N~l  £A  h,(0)  =  0  can 
be  obtained  by  minimizing  Qn(9 )  =  [N~l  JA  h,(0)]'[iV_ 1  JA  h, ■(#)]. 


10.2.6.  Convergence  Criteria 

Iterations  continue  until  there  is  virtually  no  change.  Programs  ideally  stop  when  all 
of  the  following  occur:  (1)  A  small  relative  change  occurs  in  the  objective  function 
Qn(9s)-,  (2)  a  small  change  of  the  gradient  vector  gs  occurs  relative  to  the  Hessian; 
and  (3)  a  small  relative  change  occurs  in  the  parameter  estimates  9S.  Statistical  pack¬ 
ages  typically  choose  default  threshold  values  for  these  three  changes,  called  conver¬ 
gence  criteria.  These  values  can  often  be  changed  by  the  user.  A  conservative  value 
is  1(T6. 
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In  addition  there  is  usually  a  maximum  number  of  iterations  that  will  be 
attempted.  If  this  maximum  is  reached  estimates  are  typically  reported.  The  estimates 
should  not  be  used,  however,  unless  convergence  has  been  achieved. 

If  convergence  is  achieved  then  a  local  maximum  has  been  obtained.  However,  there 
is  no  guarantee  that  the  global  maximum  is  obtained,  unless  the  objective  function  is 
globally  concave. 


10.2.7.  Starting  Values 

The  number  of  iterations  is  considerably  reduced  if  the  initial  starting  values  0\  are 
close  to  9.  Consistent  parameter  estimates  are  obviously  good  estimates  to  use  as  start¬ 
ing  values.  A  poor  choice  of  starting  values  can  lead  to  failure  of  iterative  methods.  In 
particular,  for  some  estimators  and  gradient  methods  it  may  not  be  possible  to  compute 
gi  or  A]  if  the  starting  value  is  0\  =0. 

If  the  objective  function  is  not  globally  concave  it  is  good  practice  to  use  a  range  of 
starting  values  to  increase  the  chance  of  obtaining  a  global  maximum. 

10.2.8.  Numerical  and  Analytical  Derivatives 

Any  gradient  method  by  definition  uses  derivatives  of  the  objective  function.  Either 
numerical  derivatives  or  analytical  derivatives  may  be  used. 

Numerical  derivatives  are  computed  using 

NQn(9s)  _  Qn(9s  +  hej)  —  Qn(9s  —  hej)  .  _  (104) 

where  h  is  small  and  e;  =  (0 ...  0  1  0 ...  0)'  is  a  vector  with  unity  in  the  jth  row  and 
zeros  elsewhere. 

In  theory  h  should  be  very  small,  as  formally  3  QN(9)/ <)6j  equals  the  limit  of 
AQn(0)/ A9j  as  h  — >  0.  In  practice  too  small  a  value  of  h  leads  to  inaccuracy  ow¬ 
ing  to  rounding  error.  For  this  reason  calculations  using  numerical  derivatives  should 
always  be  done  in  double  precision  or  quadruple  precision  rather  than  single  precision. 
Although  a  program  may  use  a  default  value  such  as  h  =  10  6,  other  values  will  be 
better  for  any  particular  problem.  For  example,  a  smaller  value  of  h  is  appropriate  if  the 
dependent  variable  y  in  NLS  regression  is  measured  in  thousands  of  dollars  rather  than 
dollars  (with  regressors  not  rescaled),  since  then  6  will  be  one-thousandth  the  size. 

A  drawback  of  using  numerical  derivatives  is  that  these  derivatives  have  to  be  com¬ 
puted  many  times  -  for  each  of  the  q  parameters,  for  each  of  the  N  observations,  and 
for  each  of  the  S  iterations.  This  requires  2 qNS  evaluations  of  the  objective  function, 
where  each  evaluation  itself  may  be  computationally  burdensome. 

An  alternative  is  to  use  analytical  derivatives.  These  will  be  more  accurate  than 
numerical  derivatives  and  may  be  much  quicker  to  compute,  especially  if  the  analytical 
derivatives  are  simpler  than  the  objective  function  itself.  Moreover,  only  qNS  function 
evaluations  are  needed. 

For  methods  that  additionally  require  calculation  of  second  derivatives  to  form  A.s 
there  is  even  greater  benefit  to  providing  analytical  derivatives.  Even  if  just  analyt¬ 
ical  first  derivatives  are  given,  the  second  derivative  may  then  be  more  quickly  and 
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accurately  obtained  as  the  numerical  first  derivative  of  the  analytical  first  derivative. 
Statistical  packages  often  provide  the  user  with  the  option  of  providing  analytical  first 
and  second  derivatives. 

Numerical  derivatives  have  the  advantage  of  requiring  no  coding  beyond  providing 
the  objective  function.  This  saves  coding  time  and  eliminates  one  possible  source  of 
user  error,  though  some  packages  have  the  ability  to  take  analytical  derivatives. 

If  computational  time  is  a  factor  or  if  there  is  concern  about  accuracy  of  calcula¬ 
tions,  however,  it  is  worthwhile  going  to  the  trouble  of  providing  analytical  derivatives. 
It  is  still  good  practice  then  to  check  that  the  analytical  derivatives  have  been  correctly 
coded  by  obtaining  parameter  estimates  using  numerical  derivatives,  with  starting  val¬ 
ues  the  estimates  obtained  using  analytical  derivatives. 

10.2.9.  Nongradient  Methods 

Gradient  methods  presume  the  objective  function  is  sufficiently  smooth  to  ensure  ex¬ 
istence  of  the  gradient.  For  some  examples,  notably  least  absolute  deviations  (LAD), 
quantile  regression,  and  maximum  score  estimation,  there  is  no  gradient  and  alterna¬ 
tive  iterative  methods  are  used. 

For  example,  for  LAD  the  objective  function  Qn(9s)  =  N ~ 1  JT  | y,-  —  x,/3|  has  no 
derivative  and  linear  programming  methods  are  used  in  place  of  gradient  methods. 
Such  examples  are  sufficiently  rare  in  microeconometrics  that  we  focus  almost  exclu¬ 
sively  on  gradient  methods. 

For  objective  functions  that  are  difficult  to  maximize,  particularly  because  of  multi¬ 
ple  local  optima,  use  can  be  made  of  nongradient  methods  such  as  simulated  annealing 
(presented  in  Section  10.3.8)  and  genetic  algorithms  (see  Dorsey  and  Mayer,  1995). 


10.3.  Specific  Methods 

The  leading  method  for  obtaining  a  globally  concave  objective  function  is  the  Newton- 
Raphson  iterative  method.  The  other  methods,  such  as  steepest  descent  and  DFP,  are 
usually  learnt  and  employed  when  the  Newton-Raphson  method  fails.  Another  com¬ 
mon  method  is  the  Gauss-Newton  method  for  the  NLS  estimator.  This  method  is 
not  as  universal  as  the  Newton-Raphson  method,  as  it  is  applicable  only  to  least- 
squares  problems,  and  it  can  be  obtained  as  a  minor  adaptation  of  the  Newton-Raphson 
method.  These  various  methods  are  designed  to  obtain  a  local  optimum  given  some 
starting  values  for  the  parameters. 

This  section  also  presents  the  expectation  method,  which  is  particularly  useful  in 
missing  data  problems,  and  the  method  of  simulated  annealing,  which  is  an  example  of 
a  nongradient  method  and  is  more  likely  to  yield  a  global  rather  than  local  maximum. 

10.3.1.  Newton-Raphson  Method 

The  Newton-Raphson  (NR)  method  is  a  popular  gradient  method  that  works  espe¬ 
cially  well  if  the  objective  function  is  globally  concave  in  6.  In  this  method 

g,+  i  =  6S  —  H”1  gs,  (10.5) 
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where  gs  is  defined  in  (10.2)  and 


d2QN(6) 

8989' 


(10.6) 


is  the  q  x  q  Hessian  matrix  evaluated  at  9S.  These  formulas  apply  to  both  maximiza¬ 
tion  and  minimization  of  Qn(9)  since  premultiplying  Q  y ( 9 )  by  minus  one  changes 
the  sign  of  both  Hv  1  and  gs . 

To  motivate  the  NR  method,  begin  with  the  ,vth- round  estimate  9S  for  9.  Then  by 
second-order  Taylor  series  expansion  around  9S 


Qn(9 )  =  Qn(0s)  + 


dQN(9 ) 
89' 


{9-Gs)+{-{G-9s)' 


d2QN(9) 

8989' 


(9  —  9S)  +  R. 

Gs 


Ignoring  the  remainder  term  R  and  using  more  compact  notation,  we  approximate 
Qn(0)  by 

Q*n(9)  =  Qn(9s)  +  gl (9  -  9S )  +  l-{9  -  9S)'HS(9  -  9S\ 

where  gs  and  Hv  are  defined  in  (10.2)  and  (10.6).  To  maximize  the  approxima¬ 
tion  Q*n(9)  with  respect  to  9  we  set  the  derivative  to  zero.  Then  gs  +  Ht(0  —  9S )  =  0, 
and  solving  for  9  yields  9s+\  =  9S  —  Hv  'gA,  which  is  (10.5).  The  NR  update  therefore 
maximizes  a  second-order  Taylor  series  approximation  to  Qn(9)  evaluated  at  9S. 

To  see  whether  NR  iterations  will  necessarily  increase  Qn(9),  substitute  the 
(5  +  1  )th-round  estimate  back  into  the  Taylor  series  approximation  to  obtain 

Qn(9s+ 1)  =  Qn(Gs)  -  \(9S+ 1  -  0jHs(?s+i  -  9S)  +  R. 

Ignoring  the  remainder  term,  we  see  that  this  increases  (or  decreases)  if  H  v  is  negative 
(or  positive)  definite.  At  a  local  maximum  the  Hessian  is  negative  semi-definite,  but 
away  from  the  maximum  this  may  not  be  the  case  even  for  well-defined  problems.  If 
the  NR  method  strays  into  such  territory  it  may  not  necessarily  move  toward  the  max¬ 
imum.  Furthermore  the  Hessian  is  then  singular,  in  which  case  H~ 1  in  (10.5)  cannot 
be  computed.  Clearly,  the  NR  method  works  best  for  maximization  (or  minimization) 
problems  if  the  objective  function  is  globally  concave  (or  convex),  as  then  Hv  is  al¬ 
ways  negative  (or  positive)  definite.  In  such  cases  convergence  often  occurs  within 
10  iterations. 

An  additional  attraction  of  the  NR  method  arises  if  the  starting  value  9\  is  root- A 
consistent,  that  is,  if  \fN{9\  —  9q)  has  a  proper  limiting  distribution.  Then  the  second- 
round  estimator  9o  can  be  shown  to  have  the  same  asymptotic  distribution  as  the  es¬ 
timator  obtained  by  iterating  to  convergence.  There  is  therefore  no  theoretical  gain  to 
further  iteration.  An  example  is  feasible  GLS,  where  initial  OLS  leads  to  consistent 
regression  parameter  estimates,  and  these  in  turn  are  used  to  obtain  consistent  variance 
parameter  estimates,  which  are  then  used  to  obtain  efficient  GLS.  A  second  example 
is  use  of  easily  obtained  consistent  estimates  as  starting  values  before  maximizing  a 
complicated  likelihood  function.  Although  there  is  no  need  to  iterate  further,  in  practice 
most  researchers  still  prefer  to  iterate  to  convergence  unless  this  is  computationally  too 
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time  consuming.  One  advantage  of  iterating  to  convergence  is  that  different  researchers 
should  obtain  the  same  parameter  estimates,  whereas  different  initial  root- A/  consistent 
estimates  lead  to  second-round  parameter  estimates  that  will  differ  even  though  they 
are  asymptotically  equivalent. 


10.3.2.  Method  of  Scoring 


A  common  modification  of  the  NR  method  is  the  method  of  scoring  (MS).  In  this 
method  the  Hessian  matrix  is  replaced  by  its  expected  value 


HmS.s 


d2QN{d) 

dGdG' 


(10.7) 


This  substitution  is  especially  advantageous  when  applied  to  the  MLE  (i.e.,  Q  v ( 0 )  = 
N~l Cn{0)),  because  the  expected  value  should  be  negative  definite,  since  by  the  infor¬ 
mation  matrix  equality  (see  Section  5.6.3),  Hms,.v  =  E  /dG  x  3Cn /d0'\  which 
is  positive  definite  since  it  is  a  covariance  matrix.  Obtaining  the  expectation  in  (10.7) 
is  possible  only  for  m-estimators  and  even  then  may  be  analytically  difficult. 

The  method  of  scoring  algorithm  for  the  MLE  of  generalized  linear  models,  such 
as  the  Poisson,  probit,  and  logit,  can  be  shown  to  be  implementable  using  iteratively 
reweighted  least  squares  (see  McCullagh  and  Nelder,  1989).  This  was  advantageous  to 
early  adopters  of  these  models  who  only  had  access  to  an  OLS  program. 

The  method  of  scoring  can  also  be  applied  to  m-estimators  other  than  the  MLE, 
though  then  HMs,,y  may  not  be  negative  definite. 


10.3.3.  BHHH  Method 


The  BHHH  method  of  Bemdt,  Hall,  Hall,  and  Hausman  (1974)  uses  (10.1)  with 
weighting  matrix  As  =  —  Hg^HH  s  where  the  matrix 


HbHHH.s  =  — 


E 


3 qm  3 gAG) 
3  G  3  G' 


(10.8) 


and  Qn(6)  =  JT  q,(G).  Compared  to  NR,  this  has  the  advantage  of  requiring  evalua¬ 
tion  of  first  derivatives  only,  offering  considerable  computational  savings. 

To  justify  this  method,  begin  with  the  method  of  scoring  for  the  MLE,  in  which  case 
Qn(9)  =  JT  In  fj(9),  where  fi(G)  is  the  log-density.  The  information  matrix  equality 
can  be  expressed  as 


'3  2Cn(GY 

—  — F 

"y^  31n/,  (6»)  A  91n/;(0)" 

dGdG' 

—  1—, 

4-(  d  0  4-f  3  G' 

L'  =  1  7=1  J 

and  independence  over  i  implies 


'3  2Cn(G)~ 

N 

__Ve 

r 3  In./;. (0)3  In/; (0)1 

dGdG' 

-  2^ 

i=i 

3  G  d  O' 

Dropping  the  expectation  leads  to  (10.8). 
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The  BHHH  method  can  also  be  applied  to  estimators  other  than  the  MLE,  in  which 
case  it  is  viewed  as  simply  another  choice  of  matrix  As  in  (10.1)  rather  than  as  an 
estimate  of  the  Hessian  matrix  Hs. 

The  BHHH  method  is  used  for  many  cross-section  m-estimators  as  it  can  work  well 
and  requires  only  first  derivatives. 


10.3.4.  Method  of  Steepest  Ascent 

The  method  of  steepest  ascent  sets  A.s  =  \q,  the  simplest  choice  of  weighting  matrix. 
A  line  search  is  then  done  (see  (10.3))  to  scale  Iq  by  a  constant  Xs. 

The  line  search  can  be  down  manually.  In  practice  it  is  common  to  use  the  optimal 
X  for  the  line  search,  which  can  be  shown  to  be  Xs  =  — g;'gv/g'Hvgv,  where  Hv  is  the 
Hessian  matrix.  This  optimal  Xs  requires  computation  of  the  Hessian,  in  which  case 
one  might  instead  use  NR.  The  advantage  of  steepest  ascent  rather  than  NR  is  that  Hs 
can  be  singular,  though  Hs  still  needs  to  be  negative  definite  to  ensure  Xs  <  0  so  that 
Xs\q  is  negative  definite. 


10.3.5.  DFP  and  BFGS  Methods 

The  DFP  algorithm  due  to  Davidon,  Fletcher,  and  Powell  is  a  gradient  method  with 
weighting  matrix  As  that  is  positive  definite  and  requires  computation  of  only  first 
derivatives,  unlike  NR,  which  requires  computation  of  the  Hessian.  Here  the  method 
is  presented  without  derivation. 

The  weighting  matrix  A.s  is  computed  by  the  recursion 


A  s  —  A.s_i  + 


+ 


A,_17j_i7;_1A,_1 

7j-iAj-i7s_i 


(10.9) 


where  6S~  i  =  Av_igv_i  and  7  v  _ ,  =  gs  —  gv  i .  By  inspection  of  the  right-hand  side 
of  (10.9),  As  will  be  positive  definite  provided  the  initial  Ao  is  positive  definite  (e.g., 

Ao  =  I/y). 

The  procedure  converges  quite  well  in  many  statistical  applications.  Eventually  A.s 
goes  to  the  theoretically  preferred  —  H”1.  In  principle  this  method  can  also  provide 
an  approximate  estimate  of  the  inverse  of  the  Hessian  for  use  in  computation  of  stan¬ 
dard  errors,  without  needing  either  second  derivatives  or  matrix  inversion.  In  practice, 
however,  this  estimate  can  be  a  poor  one. 

A  refinement  of  the  DFP  algorithm  is  the  BFGS  algorithm  of  Boyden,  Fletcher, 
Goldfarb,  and  Shannon  with 


a  A  i  ^-1^-1  i  A.y-i7J„i7s_iAJ_i  ,  ,  /mini 

Ay  =  As_!  +  - + - - — - (7,s-  i  Ay- 1 7,_ !  )y)s- iVs-i’  (10.10) 

fis-lls-i  7y_lA.y-l7.1-l 

where  =  (6s-i/6's_l'ys_l)  -  (Ay_i7s_1/7',_1A.s_I7i_1). 
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10.3.6.  Gauss-Newton  Method 


The  Gauss-Newton  (GN)  method  is  an  iterative  method  for  the  NLS  estimator  that 
can  be  implemented  by  iterative  OLS. 

Specifically,  for  NLS  with  conditional  mean  function  g(x, ,  /3),  the  GN  method  sets 
the  parameter  change  vector  (j3s+l  —  j3s )  equal  to  the  OLS  coefficient  estimates  from 
the  artificial  regression 

yi  ~  g(x-i,Ps)  =  P+vi.  (10.11) 

dP  ft 

Equivalently,  /3S+1  equals  the  OLS  coefficient  estimates  from  the  artificial  regression 


,  'a  \  dgi 

s(x,,A)-  w 


A 


3  = 

Ps  9/3' 


/3  +  Vi . 


(10.12) 


To  derive  this  method,  let  (3S  be  a  starting  value,  approximate  g(x, ,  (3)  by  a  first- 
order  Taylor  series  expansion 


g(x;,/ 3)  =  g(Xj,Ps)  + 


C/3  -  %), 


and  substitute  this  in  the  least-squares  objective  function  Qn(P)  to  obtain  the 
approximation 


-  g(X«,  Ps)  ~ 


dgi 

9/3' 


iP-Ps) 

A 


But  this  is  the  sum  of  squared  residuals  for  OLS  regression  of  y,  —  g(Xj ,  /3S)  on 
dgj /9/3'|;,  with  parameter  vector  (/3  —  /3S),  leading  to  (10.11).  More  formally, 

Pi 


/3.s+i  —  /3j  + 


-i 


Cy,  -g(\i,ps)). 
A 


(10.13) 


This  is  the  gradient  method  (10.1)  with  vector  gv  =  JT  9^;/9/3|^  (y,  —  y(x, .  /3S,)) 
weighted  by  matrix  As  =  [J2,  dgi /9/3x9g,- /9/3'l^r1 . 

The  iterative  method  (10.13)  equals  the  method  of  scoring  variant  of  the  Newton- 
Raphson  algorithm  for  NLS  estimation  since,  from  Section  5.8,  the  second  sum  on  the 
right-hand  side  is  the  gradient  vector  and  the  first  sum  is  minus  the  expected  value 
of  the  Hessian  (see  also  Section  10.3.9).  The  Gauss-Newton  algorithm  is  therefore  a 
special  case  of  the  Newton-Raphson,  and  NR  is  emphasized  more  here  as  it  can  be 
applied  to  a  much  wider  range  of  problems  than  can  GN. 


10.3.7.  Expectation  Maximization 

There  are  a  number  of  data  and  model  formulations  considered  in  this  book  that  can  be 
thought  of  as  involving  incomplete  or  missing  data.  For  example,  outcome  variables  of 
interest  (e.g.,  expenditure  or  the  length  of  a  spell  in  some  state)  may  be  right-censored. 
That  is,  for  some  cases  we  may  observe  the  actual  expenditure  or  spell  length,  whereas 
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in  other  cases  we  may  only  know  that  the  outcome  exceeded  some  specific  value,  say 
c* .  A  second  example  involves  a  multiple  regression  in  which  the  data  matrix  looks  as 
follows: 

"yi  xr 

?  x2  ’ 


where  ?  stands  for  missing  data.  Here  we  envisage  a  situation  in  which  we  wish  to 
estimate  a  linear  regression  model  y  =  X/3  +  u,  where  y'  =  [yi  ?],  X'  =  [Xj  X2], 
but  a  subset  of  variables  y  is  missing.  A  third  example  involves  estimating  the  parame¬ 
ters  ( 9 1 ,  #2-  •  •  ■ ,  9C,  7Ti,  . . . ,  Jtc)  of  a  C-component  mixture  distribution,  also  called  a 
latent  class  model,  h  (y|X)  =  £\-=1  Jtjfj  (y^Xy,  fit,),  where  fj  (yy |X; ,  9 y)  are  well- 
defined  pdfs.  Here  tt/  (/  =  1, . . . ,  C)  are  unknown  sampling  fractions  corresponding 
to  the  C  latent  densities  from  which  the  observations  are  sampled.  It  is  convenient  to 
think  of  this  problem  also  as  a  missing  data  problem  in  the  sense  that  if  the  sampling 
fractions  were  known  constants  then  estimation  would  be  simpler. 

The  expectation  maximization  (EM)  framework  provides  a  unifying  framework 
for  developing  algorithms  for  problems  that  can  be  interpreted  as  involving  miss¬ 
ing  data.  Although  particular  solutions  to  this  type  of  estimation  problem  have  long 
been  found  in  the  literature,  Dempster,  Laird,  and  Rubin  (1977)  provided  a  definitive 
treatment. 

Let  y  denote  the  vector  dependent  variable  of  interest,  determined  by  the  under¬ 
lying  latent  variable  vector  y* .  Let  /*(y*|X,  9)  denote  the  joint  density  of  the  latent 
variables,  conditional  on  regressors  X,  and  let  /( y|X,  9)  denote  the  joint  density  of 
the  observed  variables.  Let  there  be  a  many-to-one  mapping  from  the  sample  space 
of  y  to  that  of  v* ;  that  is,  the  value  of  the  latent  variable  v*  uniquely  determines 
y,  but  the  value  of  y  does  not  uniquely  determine  v* .  It  follows  that  /(y|X,  9)  = 
f  *(y* | X.  9)/f( y*|y,  X,  9),  since  from  Bayes  rule  the  conditional  density  /(y*|y)  = 
/( y,  y*)//(y)  =  /*(y*)//(y),  where  the  final  equality  uses  /( y*,  y)  =/*( y*)  as  y* 
uniquely  determines  y.  Rearranging  gives  /( y)  =/*(y*)//(y*|y). 

The  MLE  maximizes 

Qn(9)  =  ^Cn{9)  =  1  In /*(y*|X,  9)-^  In /( y*|y,  X,  9).  (10.14) 

Because  y*  is  unobserved  the  first  term  in  the  log-likelihood  is  ignored.  The  second 
term  is  replaced  by  its  expected  value,  which  will  not  involve  y*,  where  at  the  ,v t h 
round  this  expectation  is  evaluated  at  9  =9S. 

The  expectation  (E)  part  of  the  EM  algorithm  calculates 


Qn(0\9s)  =  -E 


1 

N 


ln/(y*|y,X,  0)|y,X,?s 


(10.15) 


where  expectation  is  with  respect  to  the  density  /(y*  |y,  X,9S).  The  maximization  (M) 
part  of  the  EM  algorithm  maximizes  Qn(9\9s)  to  obtain  9S+ 1. 

The  full  EM  algorithm  is  iterative.  The  likelihood  is  maximized,  given  the  expected 
value  of  the  latent  variable;  the  expected  value  is  evaluated  afresh  given  the  current 
value  of  9.  The  iterative  process  continues  until  convergence  is  achieved.  The  EM 
algorithm  has  the  advantage  of  always  leading  to  an  increase  or  constancy  in  Qn(9)\ 


346 


10.3.  SPECIFIC  METHODS 


see  Amemiya  (1985,  p.  376).  The  EM  algorithm  is  applied  to  a  latent  class  model  in 
Section  18.5.3  and  to  missing  data  in  Section  27.5. 

There  is  a  very  extensive  literature  on  situations  where  the  EM  algorithm  can  be 
usefully  applied,  even  though  it  can  be  applied  to  only  a  subset  of  optimization  prob¬ 
lems.  The  EM  algorithm  is  easy  to  program  in  many  cases  and  its  use  was  further  en¬ 
couraged  by  considerations  of  limited  computing  power  and  storage  that  are  no  longer 
paramount.  Despite  these  attractions,  for  censored  data  models  and  latent  class  models 
direct  estimation  using  Newton-Raphson  type  iterative  procedures  is  often  found  to  be 
faster  and  more  efficient  computationally. 


10.3.8.  Simulated  Annealing 

Simulated  annealing  (SA)  is  a  nongradient  iterative  method  reviewed  by  Goffe, 
Ferrier,  and  Rogers  (1994).  It  differs  from  gradient  methods  in  permitting  movements 
that  decrease  rather  than  increase  the  objective  function  to  be  maximized,  so  that  one 
is  not  locked  in  to  moving  steadily  toward  one  particular  local  maximum. 

Given  a  value  Gs  at  the  .vth  round  we  perturb  the  / tli  component  of  6S  to  obtain  a 
new  trial  value  of 

0*  =  0S  +  [0  •  •  •  0  (Xjrj)  0  •  ■  ■  0]' ,  (10.16) 

where  /.  /  is  a  prespecified  step  length  and  r}  is  a  draw  from  a  uniform  distribution  on 
(— 1,  1).  The  new  trial  value  is  used,  that  is,  the  method  sets  0i+i  =  0*.  if  it  increases 
the  objective  function,  or  if  it  does  not  increase  the  value  of  the  objective  function  but 
does  pass  the  Metropolis  criterion  that 

exp  {(Qn(6*s)  -  QN(es))/Ts )  >  u,  (10.17) 

where  u  is  a  drawing  from  a  uniform  (0,  1)  distribution  and  Ts  is  a  scaling  parameter 
called  the  temperature.  Thus  not  only  uphill  moves  are  accepted,  but  downhill  moves 
are  also  accepted  with  a  probability  that  decreases  with  the  difference  between  Q  v  (0*) 
and  Qn(Qs)  and  that  increases  with  the  temperature.  The  terms  simulated  annealing 
and  temperature  come  from  analogy  with  minimizing  thermal  energy  by  slowly  cool¬ 
ing  (annealing)  a  molten  metal. 

The  user  needs  to  set  the  step-size  parameter  Xj .  Goffe  et  al.  (1994)  suggest  period¬ 
ically  adjusting  Xj  so  that  50%  of  all  moves  over  a  number  of  iterations  are  accepted. 
The  temperature  also  needs  to  be  chosen  and  reduced  during  the  course  of  iterations. 
Then  the  algorithm  initially  is  searching  over  a  wide  range  of  parameter  values  before 
steadily  locking  in  on  a  particular  region. 

Fast  simulated  annealing  (FSA),  proposed  by  Szu  and  Hartley  (1987),  is  a  faster 
method.  It  replaces  the  uniform  (—1,  1)  random  number  rj  by  a  Cauchy  random  vari¬ 
able  rj  scaled  by  the  temperature  and  permits  a  fixed  step  length  Vj .  The  method  also 
uses  a  simpler  adjustment  of  the  temperature  over  iterations  with  Ts  equal  to  the  ini¬ 
tial  temperature  divided  by  the  number  of  FSA  iterations,  where  one  iteration  is  a  full 
cycle  over  the  q  components  of  6. 

Cameron  and  Johansson  (1997)  discuss  and  use  simulated  annealing,  following  the 
methods  of  Horowitz  (1992).  This  begins  with  FSA  but  on  grounds  of  computational 
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savings  switches  to  gradient  methods  (BFGS)  when  relatively  little  change  in  Q  jV ( ■ ) 
occurs  over  a  number  of  iterations  or  after  many  (250)  FSA  iterations.  In  a  simulation 
they  find  that  NR  with  a  number  of  different  starting  values  offers  a  considerable  im¬ 
provement  over  NR  with  just  one  set  of  starting  values,  but  even  better  is  FSA  with  a 
number  of  different  starting  values. 


10.3.9.  Example:  Exponential  Regression 
Consider  the  nonlinear  regression  model  with  exponential  conditional  mean 

E[y;lx;]  =  exp(x-/3),  (10.18) 

where  x,  and  ft  arc  K  x  I  vectors.  The  NLS  estimator  ft  minimizes 

QN(ft)  =  -  exp (x;/3))2,  (10.19) 

i 

where  for  notational  simplicity  scaling  by  2/ A  is  ignored.  The  first-order  conditions 
are  nonlinear  in  ft  and  there  is  no  explicit  solution  for  ft.  Instead,  gradient  methods 
need  to  be  used. 

For  this  example  the  gradient  and  Hessian  are,  respectively, 

g  =  -2  (10.20) 

i 

and 

H  =  2  {ex!/V:/3xix'  -  2(y,  -  e^e^x^  }  .  (10.21) 

i 

The  NR  iterative  method  (10.5)  uses  gs  and  Hv  equal  to  (10.20)  and  (10.21)  evaluated 
at  fts. 

A  simpler  method  of  scoring  variation  of  NR  notes  that  (10.18)  implies 

E[H]  =  2  e^e^x^.  (10.22) 


Using  E[HV]  in  place  of  Hs  yields 


fls+l  -fts  = 


-1 


It  follows  that  fts+i  —  fts  can  be  computed  from  OLS  regression  of  (y,  —  on 
ex<P'Xj.  This  is  also  the  Gauss-Newton  regression  (10.11),  since  dg(x/,  ft)/dft  = 
cxpix'/3v)x,  for  the  exponential  conditional  mean  (10.18).  Specialization  to 
expix'/3)  =  ex  pi/; )  gives  the  iterative  procedure  presented  in  Section  10.2.4. 


10.4.  Practical  Considerations 

Some  practical  issues  have  already  been  presented  in  Section  10.2,  notably  conver¬ 
gence  criteria,  modifications  such  as  step-size  adjustment,  and  the  use  of  numerical 
rather  than  analytical  derivatives.  In  this  section  a  brief  overview  of  statistical  packages 
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is  given,  followed  by  a  discussion  of  common  pitfalls  that  can  arise  in  computation  of 
a  nonlinear  estimator. 


10.4.1.  Statistical  Packages 

All  standard  microeconometric  packages  such  as  Limdep,  Stata,  PCTSP,  and  SAS  have 
built-in  procedures  to  estimate  basic  nonlinear  models  such  as  logit  and  probit.  These 
packages  are  simple  to  use,  requiring  no  knowledge  of  iterative  methods  or  even  of  the 
model  being  used.  For  example,  the  command  for  logit  regression  might  be  “logit  y 
x”  rather  than  the  command  “ols  y  x”  for  OLS.  Nonlinear  least  squares  requires  some 
code  to  convey  to  the  package  the  particular  functional  form  for  g(x,  (3 )  one  wishes 
to  specify.  Estimation  should  be  quick  and  accurate  as  the  program  should  exploit  the 
structure  of  the  particular  model.  For  example,  if  the  objective  function  is  globally 
concave  then  the  method  of  scoring  might  be  used. 

If  a  statistical  package  does  not  contain  a  particular  model  then  one  needs  to  write 
one’s  own  code.  This  situation  can  arise  with  even  minor  variation  of  standard  mod¬ 
els,  such  as  imposing  restrictions  on  parameters  or  using  parameterizations  that  are 
not  of  single-index  form.  The  code  may  be  written  using  one’s  own  favorite  statistical 
package  or  using  other  more  specialized  programming  languages.  Possibilities  include 
(1)  built-in  optimization  procedures  within  the  statistical  package  that  require  spec¬ 
ification  of  the  objective  function  and  possibly  its  derivatives;  (2)  matrix  commands 
within  the  statistical  package  to  compute  As  and  gv  and  iterate;  (3)  a  matrix  program¬ 
ming  language  such  as  Gauss,  Matlab,  OX,  SAS/IML,  or  S-Plus,  and  possibly  add-on 
optimization  routines;  (4)  a  programming  language  such  as  Fortran  or  C++;  and  (5)  an 
optimization  package  such  as  those  in  GAMS,  GQOPT,  or  NAGLIB. 

The  first  and  second  methods  are  attractive  because  they  do  not  force  the  user  to 
learn  a  new  program.  The  first  method  is  particularly  simple  for  m-estimation  as  it  can 
require  merely  specification  of  the  subfunction  qt{9)  for  the  ith  observation  rather  than 
specification  of  Qn(9).  In  practice,  however,  the  optimization  procedures  for  user- 
defined  functions  in  the  standard  packages  are  more  likely  to  encounter  numerical 
problems  than  if  more  specialized  programs  are  used.  Moreover,  for  some  packages 
the  second  method  can  require  learning  arcane  forms  of  matrix  commands. 

For  nonlinear  problems,  the  third  method  is  the  best,  although  this  might  require  the 
user  to  learn  a  matrix  programming  language  from  scratch.  One  then  is  set  up  to  han¬ 
dle  virtually  any  econometric  problem  encountered,  and  the  optimization  routines  that 
come  with  matrix  programming  languages  are  usually  adequate.  Also,  many  authors 
make  available  the  code  used  in  specific  papers. 

The  fourth  and  fifth  methods  generally  require  a  higher  level  of  programming  so¬ 
phistication  than  the  third  method.  The  fourth  method  can  lead  to  much  faster  compu¬ 
tation  and  the  fifth  method  can  solve  the  most  numerically  challenging  optimization 
problems. 

Other  practical  issues  include  cost  of  software;  the  software  used  by  colleagues;  and 
whether  the  software  has  clear  error  messages  and  useful  debugging  features,  such  as  a 
trace  program  that  tracks  line-by-line  program  execution.  The  value  of  using  software 
similar  to  that  used  by  other  colleagues  cannot  be  underestimated. 
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Table  10.2.  Computational  Difficulties:  A  Partial  Checklist 


Problem 

Check 

Data  read  incorrectly 
Imprecise  calculation 

Multicollinearity 

Singular  matrix  in  iterations 
Poor  starting  values 

Model  not  identified 

Strange  parameter  values 

Different  standard  errors 

Print  full  descriptive  statistics. 

Use  analytical  derivatives  or  numerical  with  different 
step  size  h . 

Check  condition  number  of  X'X.  Try  subset  of  regressors. 
Try  method  not  requiring  matrix  inversion  such  as  DFP. 
Try  a  range  of  different  starting  values. 

Difficult  to  check.  Obvious  are  dummy  variable  traps. 
Constant  included/excluded?  Iterations  actually 
converged? 

Which  method  was  used  to  calculate  variance  matrix? 

10.4.2.  Computational  Difficulties 

Computational  difficulties  are,  in  practice,  situations  where  it  is  not  possible  to  obtain 
an  estimate  of  the  parameters.  For  example,  an  error  message  may  indicate  that  the 
estimator  cannot  be  calculated  because  the  Hessian  is  singular.  There  are  many  possi¬ 
ble  reasons  for  this,  as  detailed  in  the  following  and  summarized  in  Table  10.2.  These 
reasons  may  also  provide  explanation  for  another  common  situation  of  parameter  esti¬ 
mates  that  are  obtained  but  are  seemingly  in  error. 

First,  the  data  may  not  have  been  read  in  correctly.  This  is  a  remarkably  common 
oversight.  With  large  data  sets  it  is  not  practical  to  print  out  all  the  data.  However,  at  a 
minimum  one  should  always  obtain  descriptive  statistics  and  check  for  anomilies  such 
as  incorrect  range  for  a  variable,  unusually  large  or  small  sample  mean,  and  unusu¬ 
ally  large  or  small  standard  deviation  (including  a  value  of  zero,  which  indicates  no 
variation).  See  Section  3.5.4  for  further  details. 

Second,  there  may  be  calculation  errors.  To  minimize  these  all  calculations  should 
be  done  in  double  precision  or  even  quadruple  precision  rather  then  single  precision. 
It  is  helpful  to  rescale  the  data  so  that  the  regressors  have  similar  means  and  variances. 
For  example,  it  may  be  better  to  use  annual  income  in  thousands  of  dollars  rather  than 
in  dollars.  If  numerical  derivatives  are  used  it  may  be  necessary  to  alter  the  change 
value  h  in  (10.4).  Care  needs  to  be  paid  to  how  functions  are  evaluated.  For  example, 
the  function  lnT(y),  where  T(-)  is  the  gamma  function,  is  best  evaluated  using  the 
log-gamma  function.  If  instead  one  evaluates  the  gamma  function  followed  by  the  log 
function  considerable  numerical  error  arises  even  for  moderate  sized  y. 

Third,  multicollinearity  may  be  a  problem.  In  single-index  models  (see  Sec¬ 
tion  5.2.4)  the  usual  checks  for  multicollinearity  will  carry  over.  The  correlation  matrix 
for  the  regressors  can  be  printed,  though  this  only  considers  pairwise  correlation.  Bet¬ 
ter  is  to  use  the  condition  number  of  X'X,  that  is,  the  square  root  of  the  ratio  of  the 
largest  to  smallest  eigenvalue  of  X'X.  If  this  exceeds  100  then  problems  may  arise.  For 
more  highly  nonlinear  models  than  single-index  ones  it  is  possible  to  have  problems 
even  if  the  condition  number  is  not  large.  If  one  suspects  multicollinearity  is  causing 
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numerical  problems  then  see  whether  it  is  possible  to  estimate  the  model  with  a  subset 
of  the  variables  that  are  less  likely  to  be  collinear. 

Fourth,  a  noninvertible  Hessian  during  iterations  does  not  necessarily  imply  singu¬ 
larity  at  the  true  maximum.  It  is  worthwhile  trying  a  range  of  iterative  methods  such 
as  steepest  ascent  with  line  search  and  DFP,  not  just  Newton-Raphson.  This  problem 
may  also  result  from  multicollinearity. 

Fifth,  try  different  starting  values.  The  iterative  gradient  methods  are  designed  to 
obtain  a  local  maximum  rather  than  the  global  maximum.  One  way  to  guard  against 
this  is  to  begin  iterations  at  a  wide  range  of  starting  values.  A  second  way  is  to  per¬ 
form  a  grid  search.  Both  of  these  approaches  theoretically  require  evaluations  at  many 
different  points  if  the  dimension  of  6  is  large,  but  it  may  be  sufficient  to  do  a  detailed 
analysis  for  a  stripped-down  version  of  the  model  that  includes  just  the  few  regressors 
thought  to  be  most  statistically  significant. 

Lastly,  the  model  may  not  be  identified.  Indeed  a  standard  necessary  condition  for 
model  identification  is  that  the  Hessian  be  invertible.  As  with  linear  models,  sim¬ 
ple  checks  include  avoiding  dummy  variable  traps  and,  if  a  subset  of  data  is  being 
used  in  initial  analysis,  determining  that  all  variables  in  the  subset  of  the  data  have 
some  variation.  For  example,  if  data  are  ordered  by  gender  or  by  age  or  by  region 
then  problems  can  arise  if  these  appear  as  indicator  variables  and  the  chosen  subset 
is  of  individuals  of  a  particular  gender,  age,  or  region.  For  nonlinear  models  it  can 
be  difficult  to  theoretically  determine  that  the  model  is  not  identified.  Often  one  first 
eliminates  all  other  potential  causes  before  returning  to  a  careful  analysis  of  model 
identification. 

Even  after  parameter  estimates  are  successfully  obtained  computational  problems 
can  still  arise,  as  it  may  not  be  possible  to  obtain  estimates  of  the  variance  matrix 
A^BA'^1 .  This  situation  can  arise  when  the  iterative  method  used,  such  as  DFP,  does 
not  use  the  Hessian  matrix  A"1  as  the  weighting  matrix  in  the  iterations.  First  check 
that  the  iterative  method  has  indeed  converged  rather  than,  for  example,  stopping  at 
a  default  maximum  number  of  iterations.  If  convergence  has  occurred,  try  alternative 
estimates  of  A,  using  the  expected  Hessian  or  using  more  accurate  numerical  com¬ 
putations  by,  for  example,  using  analytical  rather  than  numerical  derivatives.  If  such 
solutions  still  fail  it  is  possible  that  the  model  is  not  identified,  with  this  nonidentifica¬ 
tion  being  finessed  at  the  parameter  estimation  stage  by  using  an  iterative  method  that 
did  not  compute  the  Hessian. 

Other  perceived  computational  problems  are  parameter  and  variance  estimates  that 
do  not  accord  with  prior  beliefs.  For  parameter  estimates  obvious  checks  include  en¬ 
suring  correct  treatment  of  an  intercept  term  (inclusion  or  exclusion,  depending  on  the 
context),  that  convergence  has  been  achieved,  and  that  a  global  maximum  is  obtained 
(by  trying  a  range  of  starting  values).  If  standard  errors  of  parameter  estimates  dif¬ 
fer  across  statistical  packages  that  give  the  same  parameter  estimates,  the  most  likely 
cause  is  that  a  different  method  has  been  used  to  construct  the  variance  matrix  estimate 
(see  Section  5.5.2). 

A  good  computational  strategy  is  to  start  with  a  small  subset  of  the  data  and  regres¬ 
sors,  say  one  regressor  and  100  observations.  This  simplifies  detailed  tracing  of  the 
program  either  manually,  such  as  by  printing  out  key  output  along  the  way,  or  using 
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a  built-in  trace  facility  if  the  program  has  one.  If  the  program  passes  this  test  then 
computational  problems  with  the  full  model  and  data  are  less  likely  to  be  due  to  in¬ 
correct  data  input  or  coding  errors  and  are  more  likely  due  to  genuine  computational 
difficulties  such  as  multicollinearity  or  poor  starting  values. 

A  good  way  to  test  program  validity  is  to  construct  a  simulated  data  set  where  the 
true  parameters  are  known.  For  a  large  sample  size,  say  N  =  10,000,  the  estimated 
parameter  values  should  be  close  to  the  true  values. 

Finally,  note  that  obtaining  reasonable  computational  results  from  estimation  of  a 
nonlinear  model  does  not  guarantee  correct  results.  For  example,  many  early  pub¬ 
lished  applications  of  multinomial  probit  models  reported  apparently  sensible  results, 
yet  the  models  estimated  have  subsequently  been  determined  to  be  not  identified  (see 
Section  15.8.1). 


10.5.  Bibliographic  Notes 

Numerical  problems  can  arise  even  in  linear  models,  and  it  is  instructive  to  read  Davidson  and 
MacKinnon  (1993,  Section  1.5)  and  Greene  (2003,  appendix  E).  Standard  references  for  statis¬ 
tical  computation  are  Kennedy  and  Gentle  (1980)  and  especially  Press  et  al.  (1993)  and  related 
co-authored  books  by  Press.  For  evaluation  of  functions  the  standard  reference  is  Abramowitz 
and  Stegun  (1971).  Quandt  (1983)  presents  many  computational  issues,  including  optimization. 

5.3  Summaries  of  iterative  methods  are  given  in  Amemiya  (1985,  Section  4.4),  Davidson  and 
MacKinnon  (1993,  Section  6.7),  Maddala  (1977,  Section  9.8),  and  especially  Greene  (2003, 
appendix  E.6).  Harvey  (1990)  gives  many  applications  of  the  GN  algorithm,  which,  owing 
to  its  simplicity,  is  the  usual  iterative  method  for  NLS  estimation.  For  the  EM  algorithm  see 
especially  Amemiya  (1985,  pp.  375-378).  For  SA  see  Goffe  et  al.  (1994). 


- Exercises - 

1 0-1  Consider  calculation  of  the  MLE  in  the  logit  regression  model  when  the  only  re¬ 
gressor  is  the  intercept.  Then  E[y]  =  1/(1  +  e~p)  and  the  gradient  of  the  scaled 
log-likelihood  function  g(p)  =  (y-  1/(1  +  e~P)).  Suppose  a  sample  yields  y  = 
0.8  and  the  starting  value  is  /3  =  0.0. 

(a)  Calculate  p  for  the  first  six  iterations  of  the  Newton-Raphson  algorithm. 

(b)  Calculate  the  first  six  iterations  of  a  gradient  algorithm  that  sets  As  =  1  in 
(10.1),  so  /3S+1  =  p s  +  gs. 

(c)  Compare  the  performance  of  the  methods  in  parts  (a)  and  (b). 

10-2  Consider  the  nonlinear  regression  model  y=ax1  +  y/(x2  -  S)  +  u,  where  Xi 
and  x2  are  exogenous  regressors  independent  of  the  iid  error  u  ~  Af[0,  a2]. 

(a)  Derive  the  equation  for  the  Gauss-Newton  algorithm  for  estimating  (a,  y,  S). 

(b)  Derive  the  equation  for  the  Newton-Raphson  algorithm  for  estimating 

(“>  /.<$)■ 

(c)  Explain  the  importance  of  not  arbitrarily  choosing  the  starting  values  of  the 
algorithm. 

10-3  Suppose  that  the  pdf  of  y  has  a  C-component  mixture  form,  f(y|7r)  = 
EyLl  wy  fj(y),  where  n  =  (zri , . . . ,  nc),  jq  >  0,  7ry  =  1 .  The  nj  are 
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unknown  mixing  proportions  whereas  the  parameters  of  the  densities  fy(y)  are 
presumed  known. 

(a)  Given  a  random  sample  on  y,  i  =  1, _ N,  write  the  general  log-likelihood 

function  and  obtain  the  first-order  conditions  for  7?Ml-  Verify  that  there  is  no 
explicit  solution  for  7tMl- 

(b)  Let  z,  be  a  C  x  1  vector  of  latent  categorical  variables,  /'  =  1 . N,  such 

that  Zjj  =  1  if  y  comes  from  the  y'th  component  of  the  mixture  and  zy/-  =  0 
otherwise.  Write  down  the  likelihood  function  in  terms  of  the  observed  and 
latent  variables  as  if  the  latent  variable  were  observed. 

(c)  Devise  an  EM  algorithm  for  estimating  it.  [Hint:  If  zy,  were  observable  the 
MLE  of  n j  =  A/"1  zji-  The  E  step  requires  calculation  of  E[zy,|y];  the  M 
step  requires  replacing  zy)  by  E[zy)|y]  and  then  solving  for  n.] 

10-4  Let  (yi,,  y2/),  /'  =  1 . N,  have  a  bivariate  normal  distribution  with  mean 

(/xi,/x2)  and  covariance  parameters  (cru,  o-12,  <722)  and  correlation  coefficient 
p.  Suppose  that  all  N  observations  on  y 1  are  available  but  there  are  m<  N 
missing  observations  on  y2.  Using  the  fact  that  the  marginal  distribution  of  yy 
is  A'[/xy, Ujj],  and  that  conditionally  y2]yi  ~ 7V[/x2.i, ct22.i],  where  /x21  = /x2 + 
CTi2/(T22(yi  -  hi),  ct221  =  (1  -  p2)a 22,  devise  an  EM  algorithm  for  imputing  the 
missing  observations  on  yi . 
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PART  THREE 


Simulation-Based 

Methods 


Part  1  emphasized  that  microeconometric  models  are  frequently  nonlinear  models  es¬ 
timated  using  large  and  heterogeneous  data  sets  drawn  from  surveys  that  are  complex 
and  subject  to  a  variety  of  sampling  biases.  A  realistic  depiction  of  the  economic  phe¬ 
nomena  in  such  settings  often  requires  the  use  of  models  for  which  estimation  and 
subsequent  statistical  inference  are  difficult.  Advances  in  computing  hardware  and 
software  now  make  it  feasible  to  tackle  such  tasks.  Part  3  presents  modern,  computer- 
intensive,  simulation-based  methods  of  estimation  and  inference  that  mitigate  some  of 
these  difficulties.  The  background  required  to  cover  this  material  varies  somewhat  with 
the  chapter,  but  the  essential  base  is  least  squares  and  maximum  likelihood  estimation. 

Chapter  1 1  presents  bootstrap  methods  for  statistical  inference.  These  methods  have 
the  attraction  of  providing  a  simple  way  to  obtain  standard  errors  when  the  formulae 
from  asymptotic  theory  are  complex,  as  is  the  case,  for  example,  for  some  two-step 
estimators.  Furthermore,  if  implemented  appropriately,  a  bootstrap  can  lead  to  a  more 
refined  asymptotic  theory  that  may  then  lead  to  better  statistical  inference  in  small 
samples. 

Chapter  12  presents  simulation-based  estimation  methods.  These  methods  permit 
estimation  in  situations  where  standard  computational  methods  may  not  permit  calcu¬ 
lation  of  an  estimator,  because  of  the  presence  of  an  integral  over  a  probability  distri¬ 
bution  that  leads  to  no  closed-form  solution. 

Chapter  13  surveys  Bayesian  methods  that  provide  an  approach  to  estimation  and 
inference  that  is  quite  different  from  the  classical  approach  used  in  other  chapters 
of  this  book.  Despite  this  different  approach,  in  practice  in  large  sample  settings  the 
Bayesian  approach  produces  similar  results  to  those  from  classical  methods.  Further, 
they  often  do  so  in  a  computationally  more  efficient  manner. 
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CHAPTER  11 


Bootstrap  Methods 


11.1.  Introduction 

Exact  finite-sample  results  are  unavailable  for  most  microeconometrics  estimators 
and  related  test  statistics.  The  statistical  inference  methods  presented  in  preceding 
chapters  rely  on  asymptotic  theory  that  usually  leads  to  limit  normal  and  chi-square 
distributions. 

An  alternative  approximation  is  provided  by  the  bootstrap,  due  to  Efron  (1979, 
1982).  This  approximates  the  distribution  of  a  stadstic  by  a  Monte  Carlo  simulation, 
with  sampling  done  from  the  empirical  distribution  or  the  btted  distribution  of  the  ob¬ 
served  data.  The  additional  computation  required  is  usually  feasible  given  advances 
in  computing  power.  Like  conventional  methods,  however,  bootstrap  methods  rely  on 
asymptotic  theory  and  are  only  exact  in  infinitely  large  samples. 

The  wide  range  of  bootstrap  methods  can  be  classified  into  two  broad  approaches. 
First,  the  simplest  bootstrap  methods  can  permit  statistical  inference  when  conven¬ 
tional  methods  such  as  standard  error  computation  are  difficult  to  implement.  Second, 
more  complicated  bootstraps  can  have  the  additional  advantage  of  providing  asymp¬ 
totic  refinements  that  can  lead  to  a  better  approximation  in-finite  samples.  Applied 
researchers  are  most  often  interested  in  the  first  aspect  of  the  bootstrap.  Theoreticians 
emphasize  the  second,  especially  in  settings  where  the  usual  asymptotic  methods  work 
poorly  in  finite  samples. 

The  econometrics  literature  focuses  on  use  of  the  bootstrap  in  hypothesis  test¬ 
ing,  which  relies  on  approximation  of  probabilities  in  the  tails  of  the  distributions 
of  statistics.  Other  applications  are  to  confidence  intervals,  estimation  of  standard  er¬ 
rors,  and  bias  reduction.  The  bootstrap  is  straightforward  to  implement  for  smooth 
\/~N -consistent  estimators  based  on  iid  samples,  though  bootstraps  with  asymptotic  re¬ 
finements  are  underutilized.  Caution  is  needed  in  other  settings,  including  nonsmooth 
estimators  such  as  the  median,  nonparametric  estimators,  and  inference  for  data  that 
are  not  iid. 

A  reasonably  self-contained  summary  of  the  bootstrap  is  provided  in  Section  1 1.2, 
an  example  is  given  in  Section  11.3,  and  some  theory  is  provided  in  Section  11.4. 
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Further  variations  of  the  bootstrap  are  presented  in  Section  11.5.  Section  1 1.6  presents 
use  of  the  bootstrap  for  specific  types  of  data  and  specific  methods  used  often  in 
microeconometrics. 


11.2.  Bootstrap  Summary 

We  summarize  key  bootstrap  methods  for  estimator  9  and  associated  statistics  based 
on  an  iid  sample  {wi, . . . ,  w^},  where  usually  w,  =  x,)  and  9  is  a  smooth  esti¬ 

mator  that  is  y/N  consistent  and  asymptotically  normally  distributed.  For  notational 
simplicity  we  generally  present  results  for  scalar  0.  For  vector  9  in  most  instances 
replace  9  by  Oj,  the  j th  component  of  9. 

Statistics  of  interest  include  the  usual  regression  output:  the  estimate  9;  standard  er¬ 
rors  ,vg;  f -statistic  t  =  (0  —  9o)/s-§,  where  do  is  the  null  hypothesis  value;  the  associated 
critical  value  or  p-value  for  this  statistic;  and  a  confidence  interval. 

This  section  presents  bootstraps  for  each  of  these  statistics.  Some  motivation  is  also 
provided,  with  the  underlying  theory  sketched  in  Section  1 1.4. 


11.2.1.  Bootstrap  without  Refinement 

Consider  estimation  of  the  variance  of  the  sample  mean  /x  =  y  =  N~l  >’/ >  where 
the  scalar  random  variable  y,-  is  iid  [/x,  a2],  when  it  is  not  known  that  V[/x|  =  a2 /N . 

The  variance  of /x  could  be  obtained  by  obtaining  S  such  samples  of  size  N  from  the 
population,  leading  to  S  sample  means  and  hence  S  estimates  'jls  =  ys,  s  =  1, . . . ,  S. 
Then  we  could  estimate  V[/x]  by  (S  ~  l)-1  —  /T)2>  where  /x  =  jZs. 

Of  course  this  approach  is  not  possible,  as  we  only  have  one  sample.  A  bootstrap 
can  implement  this  approach  by  viewing  the  sample  as  the  population.  Then  the  finite 
population  is  now  the  actual  data  yi, . . . ,  y,\>.  The  distribution  of  /x  can  be  obtained 
by  drawing  B  bootstrap  samples  from  this  population  of  size  N,  where  each  bootstrap 
sample  of  size  N  is  obtained  by  sampling  from  yi, . . . ,  y,v  with  replacement.  This 
leads  to  B  sample  means  and  hence  B  estimates  flh  =  b  —  1, . . . ,  B.  Then  esti¬ 
mate  V[/x]  by  (B  —  l)-1  YLi®b  ~  M)2;  where  /x  =  B -1  J2b=  i  Bb-  Sampling  with 
replacement  may  seem  to  be  a  departure  from  usual  sampling  methods,  but  in  fact 
standard  sampling  theory  assumes  sampling  with  replacement  rather  than  without  re¬ 
placement  (see  Section  24.2.2). 

With  additional  information  other  ways  to  obtain  bootstrap  samples  may  be  possi¬ 
ble.  For  example,  if  it  is  known  that  y,  ~  A f\n,  cr2]  then  we  could  obtain  B  bootstrap 
samples  of  size  N  by  drawing  from  the  Af[jl,  s2  \  distribution.  This  bootstrap  is  an 
example  of  a  parametric  bootstrap,  whereas  the  preceding  bootstrap  was  from  the  em¬ 
pirical  distribution. 

More  generally,  for  estimator  9  similar  bootstraps  can  be  used  to,  for  example, 
estimate  V[0]  and  hence  standard  errors  when  analytical  formulas  for  V[  9  ]  are  com¬ 
plex.  Such  bootstraps  are  usually  valid  for  observations  w,  that  are  iid  over  i,  and  they 
have  similar  properties  to  estimates  obtained  using  the  usual  asymptotic  theory. 
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11.2.2.  Asymptotic  Refinements 

In  some  settings  it  is  possible  to  improve  on  the  preceding  bootstrap  and  obtain  es¬ 
timates  that  are  equivalent  to  those  obtained  using  a  more  refined  asymptotic  theory 
that  may  better  approximate  the  finite-sample  distribution  of  8.  Much  of  this  chapter 
is  directed  to  such  asymptotic  refinements. 

Usual  asymptotic  theory  uses  the  result  that  s/~N(0  —  8o)  — >  A/"[0,  a2].  Thus 

Pt[VN(6-e0)/cT<z]  =  ^(z)+Ri,  (11.1) 

where  <&(•)  is  the  standard  normal  cdf  and  R\  is  a  remainder  term  that  disappears  as 
N  — >  oo. 

This  result  is  based  on  asymptotic  theory  detailed  in  Section  5.3  that  includes  ap¬ 
plication  of  a  central  limit  theorem.  The  CLT  is  based  on  a  truncated  power-series 
expansion.  The  Edgeworth  expansion,  detailed  in  Section  11.4.3,  includes  additional 
terms  in  the  expansion.  With  one  extra  term  this  yields 

Pr[VfV(?-  60)/a  <  z]  =  *(z)  +  8l{z)^z)  +  R2,  (11.2) 

ViV 

where  ()>(■)  is  the  standard  normal  density,  g  i  ( • )  is  a  bounded  function  given  after 
(1 1.13)  in  Section  1 1.4.3  and  AN  is  a  remainder  term  that  disappears  as  IV  — >  oo. 

The  Edgeworth  expansion  is  difficult  to  implement  theoretically  as  the  function 
gi(-)  is  data  dependent  in  a  complicated  way.  A  bootstrap  with  asymptotic  refinement 
provides  a  simple  computational  method  to  implement  the  Edgeworth  expansion.  The 
theory  is  given  in  Section  1 1.4.4. 

Since  Ri  =  0{N~1!2)  and  Rn  =  asymptotically  Ro  <  Ri,  leading  to  a 

better  approximation  as  N  — >  oo.  However,  in  finite  samples  it  is  possible  that  /A  > 
R\.  A  bootstrap  with  asymptotic  refinement  provides  a  better  approximation  asymptot¬ 
ically  that  hopefully  leads  to  a  better  approximation  in  samples  of  the  finite  sizes  typ¬ 
ically  used.  Nevertheless,  there  is  no  guarantee  and  simulation  studies  are  frequently 
used  to  verify  that  finite-sample  gains  do  indeed  occur. 


11.2.3.  Asymptotically  Pivotal  Statistic 

For  asymptotic  refinement  to  occur,  the  statistic  being  bootstrapped  must  be  an  asymp¬ 
totically  pivotal  statistic,  meaning  a  statistic  whose  limit  distribution  does  not  depend 
on  unknown  parameters.  This  result  is  explained  in  Section  1 1.4.4. 

As  an  example,  consider  sampling  from  y,  ~  [ji,  rr  j .  Then  the  estimate  p  =  y  ~ 
A f[p,  a2 /N]  is  not  asymptotically  pivotal  even  given  a  null  hypothesis  value  p  =  p0 
since  its  distribution  depends  on  the  unknown  parameter  a2 .  However,  the  studcntized 
statistic  t  =  (p  —  ptfi/s-jz  ~  A/*[0,  1]  is  asymptotically  pivotal. 

Estimators  are  usually  not  asymptotically  pivotal.  However,  conventional  asymp¬ 
totically  standard  normal  or  chi-squared  distributed  test  statistics,  including  Wald, 
Lagrange  multiplier,  and  likelihood  ratio  tests,  and  related  confidence  intervals,  are 
asymptotically  pivotal. 
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11.2.4.  The  Bootstrap 

In  this  section  we  provide  a  broad  description  of  the  bootstrap,  with  further  details 
given  in  subsequent  sections. 


Bootstrap  Algorithm 

A  general  bootstrap  algorithm  is  as  follows: 

1.  Given  data  wi, . . . ,  w;v ,  draw  a  bootstrap  sample  of  size  N  using  a  method  given  in  the 
following  and  denote  this  new  sample  w* , . . . ,  w*N . 

2.  Calculate  an  appropriate  statistic  using  the  bootstrap  sample.  Examples  include  (a)  the 
estimate  6  of  0,  (b)  the  standard  error  sg*  of  the  estimate  6  ,  and  (c)  a  t-statistic 
t*  =  (0  —  9)/sg*  centered  at  the  original  estimate  6.  Here  6  and  sg*  are  calculated  in 
the  usual  way  but  using  the  new  bootstrap  sample  rather  than  the  original  sample. 

3.  Repeat  steps  1  and  2  B  independent  times,  where  IS  is  a  large  number,  obtaining  IS 
bootstrap  replications  of  the  statistic  of  interest,  such  as  9 y , . . . ,  9 B  or  t*,  . . . ,  tg. 

4.  Use  these  B  bootstrap  replications  to  obtain  a  bootstrapped  version  of  the  statistic,  as 
detailed  in  the  following  subsections. 

Implementation  can  vary  according  to  how  bootstrap  samples  are  obtained,  how 

many  bootstraps  are  performed,  what  statistic  is  being  bootstrapped,  and  whether  or 

not  that  statistic  is  asymptotically  pivotal. 


Bootstrap  Sampling  Methods 

The  bootstrap  dgp  in  step  1  is  used  to  approximate  the  true  unknown  dgp. 

The  simplest  bootstrapping  method  is  to  use  the  empirical  distribution  of  the  data, 
which  treats  the  sample  as  being  the  population.  Then  w* , . . . ,  w*v  are  obtained  by 
sampling  with  replacement  from  wi, . . . ,  w,y .  In  each  bootstrap  sample  so  obtained, 
some  of  the  original  data  points  will  appear  multiple  times  whereas  others  will  not 
appear  at  all.  This  method  is  an  empirical  distribution  function  (EDF)  bootstrap 
or  nonparametric  bootstrap.  It  is  also  called  a  paired  bootstrap  since  in  single¬ 
equation  regression  models  w,  =  fy, ,  x,  ),  so  here  both  y,  and  x,  are  resampled. 

Suppose  the  conditional  distribution  of  the  data  is  specified,  say  y  |x  ~  Fix.  0{]),  and 
an  estimate  8  8(t  is  available.  Then  in  step  1  we  can  instead  form  a  bootstrap  sample 

by  using  the  original  x,  while  generating  y,-  by  random  draws  from  F(x,-,  8).  This 
corresponds  to  regressors  fixed  in  repeated  samples  (see  Section  4.4.5).  Alternatively, 
we  may  first  resample  x*  from  x i ,  . . . ,  x N  and  then  generate  y,  from  Fix*.  6),  i  = 
1, . . . ,  N.  Both  are  examples  of  a  parametric  bootstrap  that  can  be  applied  in  fully 
parametric  models. 

For  regression  model  with  additive  iid  error,  say  y,  =  g(x,-,  (3 )  +  we  can  form 
fitted  residuals  u\, . . . ,  M.y,  where  /7,  =  y,  —  yix, ,  (3).  Then  in  step  1  bootstrap  from 
these  residuals  to  get  a  new  draw  of  residuals,  say  (u  * , . . . ,  'u  *N),  leading  to  a  bootstrap 
sample  (y*,  xi), . . . ,  (y*N,  Xy),  where  y*  =  g(Xj ,  (3)  +  u*.  This  bootstrap  is  called  a 
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residual  bootstrap.  It  uses  information  intermediate  between  the  nonparametric  and 
parametric  bootstrap.  It  can  be  applied  if  the  error  term  has  distribution  that  does  not 
depend  on  unknown  parameters. 

We  emphasize  the  paired  bootstrap  on  grounds  of  its  simplicity,  applicability  to 
a  wide  range  of  nonlinear  models,  and  reliance  on  weak  distributional  assumptions. 
However,  the  other  bootstraps  generally  provide  a  better  approximation  (see  Horowitz, 
2001,  p.  3185)  and  should  be  used  if  the  stronger  model  assumptions  they  entail  are 
warranted. 


The  Number  of  Bootstraps 

The  bootstrap  asymptotics  rely  on  N  oo  and  so  the  bootstrap  can  be  asymptotically 
valid  even  for  low  B.  However,  clearly  the  bootstrap  is  more  accurate  as  B  — >  oo.  A 
sufficiently  large  value  of  B  varies  with  one’s  tolerance  for  bootstrap-induced  simula¬ 
tion  error  and  with  the  purpose  of  the  bootstrap. 

Andrews  and  Buchinsky  (2000)  present  an  application-specific  numerical  method 
to  determine  the  number  of  replications  B  needed  to  ensure  a  given  level  of  accuracy 
or,  equivalently,  the  level  of  accuracy  obtained  for  a  given  value  of  B.  Let  X  denote 
the  quantity  of  interest,  such  as  a  standard  error  or  a  critical  value,  X^  denote  the  ideal 
bootstrap  estimate  with  B  =  oo,  and  XB  denote  the  estimate  with  B  bootstraps.  Then 
Andrews  and  Buchinsky  (2000)  show  that 

Vb(Xb  -  loolAoo  4  A[0,  co], 

where  co  varies  with  the  application  and  is  defined  in  Table  III  of  Andrews  and  Buchin¬ 
sky  (2000).  It  follows  that  Pr[<5  <  zT/2y/0>/  B |  =  1  —  r,  where  <5  =  \XB  —  koolAoo 
denotes  the  relative  discrepancy  caused  by  only  B  replications.  Thus  B  >  coz  ?/2A2 
ensures  the  relative  discrepancy  is  less  than  8  with  probability  at  least  1  —  r.  Alterna¬ 
tively,  given  B  replications  the  relative  discrepancy  is  less  than  8  =  Zx/iy/oo/  B. 

To  provide  concrete  guidelines  we  propose  the  rule  of  thumb  that 


B  =  384<n. 

This  ensures  that  the  relative  discrepancy  is  less  than  10%  with  probability  at  least 
0.95,  since  4ps  A-*- 12  =  384.  The  only  difficult  part  in  implementation  is  estimation  of 
co,  which  varies  with  the  application. 

For  standard  error  estimation  co  =  (2  +  y4)/4,  where  74  is  the  coefficient  of  excess 
kurtosis  for  the  bootstrap  estimator  6* .  Intuitively,  fatter  tails  in  the  distribution  of  the 
estimator  mean  outliers  are  more  likely,  contaminating  standard  error  estimation.  It 
follows  that  B  =  384  x  (1/2)  =  192  is  enough  if  y4  =  0  whereas  B  =  960  is  needed 
if  y4  =  8.  These  values  are  higher  than  those  proposed  by  Efron  and  Tibsharani  (1993, 
p.  52),  who  state  that  B  =  200  is  almost  always  enough. 

For  a  symmetric  two-sided  test  or  confidence  interval  at  level  a,  co  =  a(  1  — 
o')/[2Zff/20(Zo'/2)]2  -  This  leads  to  B  =  348  for  a  =  0.05  and  B  =  685  for  a  =  0.01. 
As  expected  more  bootstraps  are  needed  the  further  one  goes  into  the  tails  of  the 
distribution. 
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For  a  one-sided  test  or  nonsymmetric  two-sided  test  or  confidence  interval  at  level 
a,  co  =  a(l  —  Q')/[z„</)(ztt)]2.  This  leads  to  B  =  634  for  a  =  0.05  and  B  =  989  for 
a  =  0.01.  More  bootstraps  are  needed  when  testing  in  one  tail.  For  chi-squared  tests 
with  h  degrees  of  freedom  co  =  a(l  —  a)/[x2(/7)/(/2(/7))]2,  where  /(•)  is  the  x2(h) 
density. 

For  test  /(-values  co  =  (1  —  p)/ p.  For  example,  if  p  =  0.05  then  co  =  19  and  B  = 
7,296.  Many  more  bootstraps  are  needed  for  precise  calculation  of  the  test  /?- value 
compared  to  hypothesis  rejection  if  a  critical  value  is  exceeded. 

For  bias-corrected  estimation  of  9  a  simple  rule  uses  co  =  a2 /9~ ,  where  the  esti¬ 
mator  9  has  standard  error  a.  For  example,  if  the  usual  /-statistic  t  =  d/a  =  2  then 
co  =  1/4  and  B  =  96.  Andrews  and  Buchinsky  (2000)  provide  many  more  details  and 
refinements  of  these  results. 

For  hypothesis  testing,  Davidson  and  MacKinnon  (2000)  provide  an  alternative 
approach.  They  focus  on  the  loss  of  power  caused  by  bootstrapping  with  finite  B. 
(Note  that  there  is  no  power  loss  if  B  =  oo.)  On  the  basis  of  simulations  they  recom¬ 
mend  at  least  B  =  399  for  tests  at  level  0.05,  and  at  least  B  =  1,499  for  tests  at  level 
0.01.  They  argue  that  for  testing  their  approach  is  superior  to  that  of  Andrews  and 
Buchinsky. 

Several  other  papers  by  Davidson  and  MacKinnon,  summarized  in  MacKinnon 
(2002),  emphasize  practical  considerations  in  bootstrap  inference.  For  hypothesis  test¬ 
ing  at  level  a  choose  B  so  that  a(  B  +  1)  is  an  integer.  For  example,  at  a  =  0.05  let 
B  =  399  rather  than  400.  If  instead  B  =  400  it  is  unclear  on  an  upper  one-sided  al¬ 
ternative  test  whether  the  20th  or  21st  largest  bootstrap  /-statistic  is  the  critical  value. 
For  nonlinear  models  computation  can  be  reduced  by  performing  only  a  few  Newton- 
Raphson  iterations  in  each  bootstrap  sample  from  starting  values  equal  to  the  initial 
parameter  estimates. 


11.2.5.  Standard  Error  Estimation 

The  bootstrap  estimate  of  variance  of  an  estimator  is  the  usual  formula  for  estimating 
a  variance,  applied  to  the  B  bootstrap  replications  9  *, . . . ,  0*B: 

no) 

b=  1 


o*  =  b~i  Yy*b.  (n.4) 

6=1 

Taking  the  square  root  yields  .S’g  Boot,  the  bootstrap  estimate  of  the  standard  error. 

This  bootstrap  provides  no  asymptotic  refinement.  Nonetheless,  it  can  be  ex¬ 
traordinarily  useful  when  it  is  difficult  to  obtain  standard  errors  using  conventional 
methods.  There  are  many  examples.  The  estimate  9  may  be  a  sequential  two-step 
m-estimator  whose  standard  error  is  difficult  to  compute  using  the  results  given  in 
Secttion  6.8.  The  estimate  9  may  be  a  2SLS  estimator  estimated  using  a  package  that 
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only  reports  standard  errors  assuming  homoskedastic  errors  but  the  errors  are  actu¬ 
ally  heteroskedastic.  The  estimate  9  may  be  a  function  of  other  parameters  that  are 
actually  estimated,  for  example,  9  =  c?//3,  and  the  bootstrap  can  be  used  instead  of 
the  delta  method.  For  clustered  data  with  many  small  clusters,  such  as  short  panels, 
cluster-robust  standard  errors  can  be  obtained  by  resampling  the  clusters. 

Since  the  bootstrap  estimate  s-§  Boot  is  consistent,  it  can  be  used  in  place  of  s-§  in 
the  usual  asymptotic  formula  to  form  confidence  intervals  and  hypothesis  tests  that 
are  asymptotically  valid.  Thus  asymptotic  statistical  inference  is  possible  in  settings 
where  it  is  difficult  to  obtain  standard  errors  by  other  methods.  However,  there  will  be 
no  improvement  in  finite-sample  performance.  To  obtain  an  asymptotic  refinement 
the  methods  of  the  next  section  are  needed. 


11.2.6.  Hypothesis  Testing 

Here  we  consider  tests  on  an  individual  coefficient,  denoted  9.  The  test  may  be  either 
an  upper  one-tailed  alternative  of  Hq  :  9  <  9q  against  Ha  :  9  >  9o  or  a  two-sided  test 
of  Hq  :  9  =  9o  against  Ha  :  6  ^  9q.  Other  tests  are  deferred  to  Section  1 1.6.3. 


Tests  with  Asymptotic  Refinement 

The  usual  test  statistic  TV  =  (9  —  9o)/s-g  provides  the  potential  for  asymptotic  refine¬ 
ment,  as  it  is  asymptotically  pivotal  since  its  asymptotic  standard  normal  distribution 
does  not  depend  on  unknown  parameters.  We  perform  B  bootstrap  replications  pro¬ 
ducing  B  test  statistics  fj\  . . . ,  tg,  where 

tt  =  <K-w%-  ai-5) 

The  estimates  t/  are  centered  around  the  original  estimate  9  since  resampling  is 
from  a  distribution  centered  around  9.  The  empirical  distribution  of  t*, ...  ,tg,  or¬ 
dered  from  smallest  to  largest,  is  then  used  to  approximate  the  distribution  of  TV  as 
follows. 

For  an  upper  one-tailed  alternative  test  the  bootstrap  critical  value  (at  level  a) 
is  the  upper  a  quantile  of  the  B  ordered  test  statistics.  For  example,  if  B  =  999  and 
a  =  0.05  then  the  critical  value  is  the  950th  highest  value  of  t*.  since  then  (B  +  1)(1  — 
a)  =  950.  For  a  similar  lower  tail  one-sided  test  the  critical  value  is  the  50th  smallest 
value  oft*. 

One  can  also  compute  a  bootstrap  /r  value  in  the  obvious  way.  For  example,  if  the 
original  statstistic  t  lies  between  the  914th  and  915th  largest  values  of  999  bootstrap 
replicates  then  the  /;- value  for  a  upper  one-tailed  alternative  test  is  1  —  914/(5  +  1)  = 
0.086. 

For  a  two-sided  test  a  distinction  needs  to  be  made  between  symmetrical  and 
nonsymmetrical  tests.  For  a  nonsymmetrical  test  or  equal-tailed  test  the  bootstrap 
critical  values  (at  level  a)  are  the  lower  a/2  and  upper  a/2  quantiles  of  the  ordered 
test  statistics  t*,  and  the  null  hypothesis  is  rejected  at  level  a  if  the  original  f-statistic 
lies  outside  this  range.  For  a  symmetrical  test  we  instead  order  |t*|  and  the  bootstrap 
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critical  value  (at  level  a)  is  the  upper  a  quantile  of  the  ordered  \t*\.  The  null  hypoth¬ 
esis  is  rejected  at  level  a  if  \t  \  exceeds  this  critical  value. 

These  tests,  using  the  percentile-f  method,  provide  asymptotic  refinements.  For  a 
one-sided  /-test  and  for  a  nonsymmetrical  two-sided  /-test  the  true  size  of  the  test  is 
a  +  0(N  l/2)  with  standard  asymptotic  critical  values  and  a  +  0(N  1 )  with  boot¬ 
strap  critical  values.  For  a  two-sided  symmetrical  r-test  or  for  an  asymptotic  chi- 
square  test  the  asymptotic  approximations  work  better,  and  the  true  size  of  the  test 
is  a  +  0(N~l)  using  standard  asymptotic  critical  values  and  a  +  0(N~2)  using  boot¬ 
strap  critical  values. 


Tests  without  Asymptotic  Refinement 

Alternative  bootstrap  methods  can  be  used  that  although  asymptotically  valid  do  not 
provide  an  asymptotic  refinement. 

One  approach  already  mentioned  at  the  end  of  Section  1 1 .2.5  is  to  compute  t  = 
(9  —  9q) / sg  boot’  where  the  bootstrap  estimate  sg  bo(ll  given  in  (1 1.3)  replaces  the  usual 
estimate  sg,  and  compare  this  test  statistic  to  critical  values  from  the  standard  normal 
distribution. 

A  second  approach,  exposited  here  for  a  two-sided  test  of  Hq  :  9  =  9q  against 
Ha  :  9  ^  9q,  finds  the  lower  a/2  and  upper  a/2  quantiles  of  the  bootstrap  estimates 
9*,  ...,9*b  and  rejects  Hq  if  9q  falls  outside  this  region.  This  is  called  the  percentile 
method.  Asymptotic  refinement  is  obtained  by  using  t/  in  (11.5)  that  centers  around 
9  rather  than  9q  and  using  a  different  standard  error  s~  in  each  bootstrap. 

These  two  bootstraps  have  the  attraction  of  not  requiring  computation  of  sg,  the 
usual  standard  error  estimate  based  on  asymptotic  theory. 


11.2.7.  Confidence  Intervals 

Much  of  the  statistics  literature  considers  confidence  interval  estimation  rather  than  its 
flip  side  of  hypothesis  tests.  Here  instead  we  began  with  hypothesis  tests,  so  only  a 
brief  presentation  of  confidence  intervals  is  necessary. 

An  asymptotic  refinement  is  based  on  the  /-statistic,  which  is  asymptotically  piv¬ 
otal.  Thus  from  steps  1-3  in  Section  1 1.2.4  we  obtain  bootstrap  replication  /-statist ics 
/*,...,  tB.  Then  let  _am  and  t[am  denote  the  lower  and  upper  a/2  quantiles  of  these 
/-statistics.  The  percentile-/  method  100(1  —  a)  percent  confidence  interval  is 

{9  —  /[* l-a/2]  X  Sg,  0  +  t*aj2]  x  sg) ,  (Tl-6) 

where  9  and  sg  are  the  estimate  and  standard  error  from  the  original  sample. 

An  alternative  is  the  bias-corrected  and  accelerated  (BCa)  method  detailed  in 
Efron  (1987).  This  offers  an  asymptotic  refinement  in  a  wider  class  of  problems  than 
the  percentile-t  method. 

Other  methods  provide  an  asymptotically  valid  confidence  interval,  but  without 
asymptotic  refinement.  First,  one  can  use  the  bootstrap  estimate  of  the  standard 
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error  in  the  usual  confidence  interval  formula,  leading  to  interval  ( 9  —  z\ \  -u/2\  x 
s?,boot’  ^  +  Z[«/2]  x  Sg  boot).  Second,  the  percentile  method  confidence  interval  is  the 
distance  between  the  lower  a /2  and  upper  a /2  quantiles  of  the  B  bootstrap  estimates 


0*,  ...  ,9g  of  9. 


11.2.8.  Bias  Reduction 

Nonlinear  estimators  are  usually  biased  in  finite  samples,  though  this  bias  goes  to  zero 
asymptotically  if  the  estimator  is  consistent.  For  example,  if  /r3  is  estimated  by  9  =  y3, 
where  >’,■  is  iid  [/x,  a2],  then  E[#  —  /x3  ]  =  3 /xer2 / N  +E[(y  —  /a)3]/N2. 

More  generally,  for  a  \/N -consistent  estimator 

^  CIm  b\f  Cm 

E[9-e0]=-^  +  ^  +  -^  +  ...,  (11.7) 

N  N2  N3 

where  aN,t>N,  and  are  bounded  constants  that  vary  with  the  data  and  estimator  (see 
Hall,  1992,  p.  53).  An  alternative  estimator  0  provides  an  asymptotic  refinement  if 

~  Bn  Cn 

E[9-90]  =  ^  +  ^  +  ...,  (11.8) 

where  BN  and  C,y  are  bounded  constants.  For  both  estimators  the  bias  disappears  as 
N  oo.  The  latter  estimator  has  the  attraction  that  the  bias  goes  to  zero  at  a  faster 
rate,  and  hence  it  is  an  asymptotic  refinement,  though  in  finite  samples  it  is  possible 
that  ( Bn/N 2)  >  ( aN/N  +  bN/N2).^ 

We  wish  to  estimate  the  bias  E [6]  —  9.  This  is  the  distance  between  the  expected 
value  or  population  average  value  of  the  parameter  and  the  parameter  value  generating 
the  data.  The  bootstrap  replaces  the  population  with  the  sample,  so  that  the  bootstrap 
samples  are  generated  by  parameter  9,  which  has  average  value  9  over  the  bootstraps. 

The  bootstrap  estimate  of  the  bias  is  then 

Bias g  =  (T-9),  (11.9) 

where  9  is  defined  in  (1 1.4). 

, _ _ 

Suppose,  for  example,  that  9  =  4  and  9=5.  Then  the  estimated  bias  is  (5  —  4)  = 
1,  an  upward  bias  of  1.  Since  9  overestimates  by  1,  bias  correction  requires  subtracting 
1  from  9,  giving  a  bias-corrected  estimate  of  3.  More  generally,  the  bootstrap  bias- 
corrected  estimator  of  9  is 

0Boot  =?-(?*-?)  (11.10) 

=  26-9*. 

Note  that  9  itself  is  not  the  bias-corrected  estimate.  For  more  details  on  the  direction 
of  the  correction,  which  may  seem  puzzling,  see  Efron  and  Tibsharani  (1993,  p.  138). 
For  typical  v/TV-consistent  estimators  the  asymptotic  bias  of  9  is  0(N  2  )  whereas  the 
asymptotic  bias  of  $Boot  is  instead  0(N~2). 

In  practice  bias  correction  is  seldom  used  for  \/N -consistent  estimators,  as  the  boot¬ 
strap  estimate  can  be  more  variable  than  the  original  estimate  9  and  the  bias  is  often 
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small  relative  to  the  standard  error  of  the  estimate.  Bootstrap  bias  correction  is  used 
for  estimators  that  converge  at  rate  less  than  s/~N,  notably  nonparametric  regression 
and  density  estimators. 


11.3.  Bootstrap  Example 

As  a  bootstrap  example,  consider  the  exponential  regression  model  introduced  in  Sec¬ 
tion  5.9.  Here  the  data  are  generated  from  an  exponential  distribution  with  an  expo¬ 
nential  mean  with  two  regressors: 

| Xj  ~  exponential^;),  i  =  1, . . . ,  50, 

Xi  —  exp(j8i  +  ft2x2i  +  P3X3 ;), 

(x2l,x3l)  ~  Af[0.1,  0.1;  0.12,  0.12,  0.005], 

(ftu  fti,  p3)  =  (-2,2,  2). 

Maximum  likelihood  estimation  on  a  sample  of  50  observations  yields  ftx  = 
-2.192;  %  =  0.267,  s2  =  1-417,  and  t2  =  0.188;  and  %  =  4.664,  j3  =  1.741,  and 
t3  =  2.679.  For  this  ML  example  the  standard  errors  were  based  on  —A  ~ 1 ,  minus  the 
inverse  of  the  estimated  Hessian  matrix. 

We  concentrate  on  statistical  inference  for  ft\  and  demonstrate  the  bootstrap  for 
standard  error  computation,  test  of  statistical  significance,  confidence  intervals,  and 
bias  correction.  The  differences  between  bootstrap  and  usual  asymptotic  estimates  are 
relatively  small  in  this  example  and  can  be  much  larger  in  other  examples. 

The  results  reported  here  are  based  on  the  paired  bootstrap  (see  Section  1 1 .2.4)  with 
(y,-,  X2i ,  x3i)  jointly  resampled  with  replacement  B  =  999  times.  From  Table  11.1,  the 
999  bootstrap  replication  estimates  ft3  b,  b  =  1, . . . ,  999,  had  mean  4.7 16  and  standard 
deviation  of  1.939.  Table  11.1  also  gives  key  percentiles  for  ftt  and  t*  (defined  in  the 
following). 

A  parametric  bootstrap  could  have  been  used  instead.  Then  bootstrap  samples 
would  be  obtained  by  drawing  yt  from  the  exponential  distribution  with  parameter 
cxpi/i  |  +  ft  1X2,  +  ft 3x3ft).  In  the  case  of  tests  of  H{)  :  ft3  =  0  the  exponential  param¬ 
eter  could  instead  be  cxpi/i,  +  ft2x2l ),  where  ft  {  and  ft2  are  then  the  restricted  ML 
estimates  from  the  original  sample. 

Standard  errors:  From  (11.3)  the  bootstrap  estimate  of  standard  error  is  computed 
using  the  usual  standard  deviation  fonnula  for  the  999  bootstrap  replication  esti¬ 
mates  of  ft>3 .  This  yields  estimate  1.939  compared  to  the  usual  asymptotic  standard 
error  estimate  of  1.741.  Note  that  this  bootstrap  offers  no  refinement  and  would 
only  be  used  as  a  check  or  if  finding  the  standard  error  by  other  means  proved 
difficult. 

Hypothesis  testing  with  asymptotic  refinement:  We  consider  test  of  H0  :  ft2  =  0 
against  Hu  :  /L  0  at  level  0.05.  A  test  with  asymptotic  refinement  is  based  on  the 
t-statistic,  which  is  asymptotically  pivotal.  From  Section  1 1.2.6  for  each  bootstrap 
we  compute  =  (ft3  —  4.664)/^*,  which  is  centered  on  the  estimate  ft2  =  4.664 
from  the  original  sample.  For  a  nonsymmetrical  test  the  bootstrap  critical  values 
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Table  11.1.  Bootstrap  Statistical  Inference  on  a  Slope  Coefficient: 
Example a 


K 

/* 

z  = *(00) 

*(47) 

Mean 

4.716 

0.026 

1.021 

1.000 

SD* 

1.939 

1.047 

1.000 

1.021 

1% 

-.336 

-2.664 

-2.326 

-2.408 

2.5% 

0.501 

-2.183 

-1.960 

-2.012 

5% 

1.545 

-1.728 

-1.645 

-1.678 

25% 

3.570 

-0.621 

-0.675 

-0.680 

50% 

4.772 

0.062 

0.000 

0.000 

75% 

5.971 

0.703 

0.675 

0.680 

95% 

7.811 

1.706 

1.645 

1.678 

97.5% 

8.484 

2.066 

1.960 

2.012 

99.0% 

9.427 

2.529 

2.326 

2.408 

a  Summary  statistics  and  percentiles  based  on  999  paired  bootstrap  resamples  for 
(1)  estimate  (2)  the  associated  statistics  f  =  (/33— /I3 )/s^»;  (3)  student  t- 
distribution  with  47  degrees  of  freedom;  (4)  standard  normal  distribution.  Original 
dgp  is  one  draw  from  the  exponential  distribution  given  in  the  text;  the  sample  size 
is  50. 

h  SD,  standard  deviation. 


equal  the  lower  and  upper  2.5  percentiles  of  the  999  values  of  t'f  the  25th  lowest 
and  25th  highest  values.  From  Table  11.1  these  are  —2.183  and  2.066.  Since  the 
^-statistic  computed  from  the  original  sample  I3  =  (4.664  —  0)/ 1.741  =  2.679  > 
2.066,  the  null  hypothesis  is  rejected.  A  symmetrical  test  that  instead  uses  the  upper 
5  percentile  of  If,  I  yields  bootstrap  critical  value  2.078  that  again  leads  to  rejection 
of  Hq  at  level  0.05. 

The  bootstrap  critical  values  in  this  example  exceed  those  using  the  asymptotic 
approximation  of  either  standard  normal  or  f(47),  an  ad  hoc  finite-sample  adjust¬ 
ment  motivated  by  the  exact  result  for  linear  regression  under  normality.  So  the 
usual  asymptotic  results  in  this  example  lead  to  overrejection  and  have  actual  size 
that  exceeds  the  nominal  size.  For  example,  at  5%  the  z  critical  region  values 
of  (—1.960,  1.960)  are  smaller  than  the  bootstrap  critical  values  (—2.183,  2.066). 
Figure  11.1  plots  the  bootstrap  estimate  based  on  of  the  density  of  the  f-test, 
smoothed  using  kernel  methods,  and  compares  it  to  the  standard  normal.  The  two 
densities  appear  close,  though  the  left  tail  is  notably  fatter  for  the  bootstrap  estimate. 
Table  11.1  makes  clearer  the  difference  in  the  tails. 

Hypothesis  testing  without  asymptotic  refinement:  Alternative  bootstrap  testing 
methods  can  be  used  but  do  not  offer  an  asymptotic  refinement.  First,  using  the 
bootstrap  standard  error  estimate  of  1.939,  rather  than  the  asymptotic  standard  error 
estimate  of  1.741,  yields  f  =  (4.664  —  0)/1.939  =  2.405.  This  leads  to  rejection  at 
level  0.05  using  either  standard  normal  or  f(47)  critical  values.  Second,  from  Table 
11.1,  95%  of  the  bootstrap  estimates  lie  in  the  range  (0.501,  8.484),  which  does 
not  include  the  hypothesized  value  of  0,  so  again  we  reject  Hq  :  /3 3  =  0. 
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Bootstrap  Density  of  ‘t-Statistic’ 


t-statistic  from  each  bootstrap  replication 

Figure  11.1:  Bootstrap  density  of  f-test  statistic  for  slope  equal  to  zero  obtained  from 
999  bootstrap  replications  with  standard  normal  density  plotted  for  comparison.  Data  are 
generated  from  an  exponential  distribution  regression  model. 

Confidence  intervals:  An  asymptotic  refinement  is  obtained  using  the  95%  percentile - 
t  confidence  interval.  Applying  (11.6)  yields  (4.664  —  2.183  x  1.741,4.664  + 
2.066  x  1.741)  or  (0.864,  8.260).  This  compares  to  a  conventional  95%  asymptotic 
confidence  interval  of  4.664  ±  1.960  x  1.741  or  (1.25,  8.08). 

Other  confidence  intervals  can  be  constructed,  but  these  do  not  have  an  asymp¬ 
totic  refinement.  Using  the  bootstrap  standard  error  estimate  leads  to  a  95%  con¬ 
fidence  interval  4.664+  1.960  x  1.939  =  (0.864,  8.464).  The  percentile  method 
uses  the  lower  and  upper  2.5  percentiles  of  the  999  bootstrap  coefficient  estimates, 
leading  to  a  95%  confidence  interval  of  (0.501,  8.484). 

Bias  correction:  The  mean  of  the  999  bootstrap  replication  estimates  of  ft  3  is 
4.716,  compared  to  the  original  estimate  of  4.664.  The  estimated  bias  of  (4.716  — 
4.664)  =  0.052  is  quite  small,  especially  compared  to  the  standard  error  of  S3  = 
1.741.  The  estimated  bias  is  upward  and  (1 1.10)  yields  a  bias-corrected  estimate  of 
@3  equal  to  4.664  —  0.052  =  4.612. 

The  bootstrap  relies  on  asymptotic  theory  and  may  actually  provide  a  finite- 
sample  approximation  worse  than  that  of  conventional  methods.  To  determine  that 
the  bootstrap  is  really  an  improvement  here  we  need  a  full  Monte  Carlo  analysis 
with,  say,  1,  000  samples  of  size  50  drawn  from  the  exponential  dgp,  with  each  of 
these  samples  then  bootstrapped,  say,  999  times. 


11.4.  Bootstrap  Theory 

The  exposition  here  follows  the  comprehensive  survey  of  Horowitz  (2001).  Key  results 
are  consistency  of  the  bootstrap  and,  if  the  bootstrap  is  applied  to  an  asymptotically 
pivotal  statistic,  asymptotic  refinement. 
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11.4.1.  The  Bootstrap 

We  use  Xi, . . . ,  XN  as  generic  notation  for  the  data,  where  for  notational  simplicity 
bold  is  not  used  for  X,  even  though  it  is  usually  a  vector,  such  as  (y, ,  x,).  The  data  are 
assumed  to  be  independent  draws  from  distribution  with  cdf  Fq(x)  =  Pr[  X  <  x].  In 
the  simplest  applications  Fq  is  in  a  finite-dimensional  family,  with  Fq  =  Ff)(x.  9q). 

The  statistic  being  considered  is  denoted  TN  =  TN{X\, . . .  ,XN).  The  exact  finite- 
sample  distribution  of  TN  is  G ,v  =  Gy(L  Fq)  =  Pr[  TN  <  t  ].  The  problem  is  to  find  a 
good  approximation  to  G  ;y . 

Conventional  asymptotic  theory  uses  the  asymptotic  distribution  of  7#,  denoted 
Goo  =  Goo(t,  To).  This  may  theoretically  depend  on  unknown  To,  in  which  case  we 
use  a  consistent  estimate  of  To.  For  example,  use  To  =  To(-,  9),  where  9  is  consistent 
for  9q. 

The  empirical  bootstrap  takes  a  quite  different  approach  to  approximating 
Gn(x  To).  Rather  than  replace  Gn  by  Goo,  the  population  cdf  To  is  replaced  by  a 
consistent  estimator  FN  of  Tq,  such  as  the  empirical  distribution  of  the  sample. 

G ,v ( ■  ■  Fn)  cannot  be  determined  analytically  but  can  be  approximated  by  boot¬ 
strapping.  One  bootstrap  resample  with  replacement  yields  the  statistic  T*N  = 
TN(X\, . . .  ,X*N).  Repeating  this  step  B  independent  times  yields  replications 
j, . . . ,  B.  The  empirical  cdf  of  x, ...  ,T£  B  is  the  bootstrap  estimate  of  the 
distribution  of  T,  yielding 

1  B 

GN(t,FN)=  -  Tl(T*b<t),  (11.11) 

B 

where  1(A)  equals  one  if  event  A  occurs  and  equals  zero  otherwise.  This  is  just  the 
proportion  of  the  bootstrap  resamples  for  which  the  realized  T ^  <  t. 

The  notation  is  summarized  in  Table  1 1.2. 


11.4.2.  Consistency  of  the  Bootstrap 

The  bootstrap  estimate  G  y  ( t .  FN)  clearly  converges  to  G  \  (t ,  Ty )  as  the  number  of 
bootstraps  B  oo.  Consistency  of  the  bootstrap  estimate  G/v(t,  FN )  for  G^(t,  Tq) 


Table  11.2.  Bootstrap  Theory  Notation 


Quantity 

Notation 

Sample  (iid) 

X i ,  . . . ,  Xff,  where  X,  is  usually  a  vector 

Population  cdf  of  X 

T)  =  Fq(x,  9q)  =  Pr[X  <  x] 

Statistic  of  interest 

Tn  =  Tn(Xu...,Xn) 

Finite  sample  cdf  of  TN 

Gn  =  GN(t,  F0)  =  PrfTv  <  t] 

Limit  cdf  of  7# 

Goo  =  Goo(r,  To)  ^  ^ 

Asymptotic  cdf  of  Tv 

Goo  =  Goo(t,  T0),  where  F0  =  F0{x,  9) 

Bootstrap  cdf  of  7’v 

GN(t ,  Tv)  =  B~l  Et,  1  {T^b  <  G 
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therefore  requires  that 


Gjv(f,  Fn)  -4-  GN(t ,  Fq), 

uniformly  in  the  statistic  t  and  for  all  Fq  in  the  space  of  permitted  cdfs. 

Clearly,  FN  must  be  consistent  for  Fq.  Additionally,  smoothness  in  the  dgp  Fq(x)  is 
needed,  so  that  FN(x )  and  Fq(x)  are  close  to  each  other  uniformly  in  the  observations 
x  for  large  N.  Moreover,  smoothness  in  G#('»  F),  the  cdf  of  the  statistic  considered  as 
a  functional  of  F,  is  required  so  that  Gn(-,  FN)  is  close  to  Gn(-,  Fq)  when  N  is  large. 

Horowitz  (2001,  pp.  3166-3168)  gives  two  formal  theorems,  one  general  and  one 
for  iid  data,  and  provides  examples  of  potential  failure  of  the  bootstrap,  including 
estimation  of  the  median  and  estimation  with  boundary  constraints  on  parameters. 

Subject  to  consistency  of  Fn  for  Fq  and  smoothness  requirements  on  Fq  and  Gn, 
the  bootstrap  leads  to  consistent  estimates  and  asymptotically  valid  inference.  The 
bootstrap  is  consistent  in  a  very  wide  range  of  settings. 


11.4.3.  Edgeworth  Expansions 

An  additional  attraction  of  the  bootstrap  is  that  it  allows  for  asymptotic  refinement. 
Singh  (1981)  provided  a  proof  using  Edgeworth  expansions,  which  we  now  introduce. 

Consider  the  asymptotic  behavior  of  Z  v  =  A,  / \fN .  where  for  simplicity  X,  are 
standardized  scalar  random  variables  that  are  iid  [0,  1].  Then  application  of  a  central 
limit  theorem  leads  to  a  limit  standard  normal  distribution  for  ZN.  More  precisely,  ZN 
has  cdf 

Gn(z)  =  MZn  <z]=  d>(z)  +  0(N~l/2),  (11.12) 


where  O(-)  is  the  standard  normal  cdf.  The  remainder  term  is  ignored  and  regular 
asymptotic  theory  approximates  Gn(z )  by  Goo(z)  =  dTz). 

The  CLT  leading  to  (11.12)  is  formally  derived  by  a  simple  approximation  of  the 
characteristic  function  of  Zn,  E[e’sZN],  where  i  =  —  VT.  A  better  approximation 
expands  this  characteristic  function  in  powers  of  N  l/2.  The  usual  Edgeworth  expan¬ 
sion  adds  two  additional  terms,  leading  to 


Gat(z)  =  Pr [ZN  <  z]  =  <J>(z)  + 


gi(z) 

Viv 


^  +  0(N~V2), 
X 


(11.13) 


where  gi(z)  =  — (z2  —  1  )0(z)at3 /6,  0(-)  denotes  the  standard  normal  density,  is  the 
third  cumulant  of  ZN,  and  the  lengthy  expression  for  go(')  is  given  in  Rothenberg 
(1984,  p.  895)  or  Amemiya  (1985,  p.  93).  In  general  the  rth  cumulant  xr  is  the  rth 
coefficient  in  the  series  expansion  ln(E[e!iZ,v])  =  Y17=o  KrGis)'/ f !  of  the  log  charac¬ 
teristic  function  or  cumulant  generating  function. 

The  remainder  term  in  (1 1 . 13)  is  ignored  and  an  Edgeworth  expansion  approximates 
Gn(z ,  Fo)  by  Goo(z.  F0 )  =  <fi(z)  +  X~1/2gi(z)  +  N^lg2(z).  If  ZN  is  a  test  statistic 
this  can  be  used  to  compute  /)- values  and  critical  values.  Alternatively,  (11.13)  can  be 
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inverted  to 


Pr 


h\(z)  ,  h2(z) 


Vn 


N 


<  z 


-  *(z). 


(11.14) 


for  functions  /7 1 (z)  and  h2(z)  given  in  Rothenberg  (1984,  p.  895).  The  left-hand  side 
gives  a  modified  statistic  that  will  be  better  approximated  by  the  standard  normal  than 
the  original  statistic  Zn- 

The  problem  in  application  is  that  the  cumulants  of  ZN  are  needed  to  evaluate  the 
functions  gi(z)  and  g2(z)  or  h  \(z)  and  h2(z).  It  can  be  very  difficult  to  obtain  analytical 
expressions  for  these  cumulants  (e.g.,  Sargan,  1980,  and  Phillips,  1983).  The  bootstrap 
provides  a  numerical  method  to  implement  the  Edgeworth  expansion  without  the  need 
to  calculate  cumulants,  as  shown  in  the  following. 


11.4.4.  Asymptotic  Refinement  via  Bootstrap 

We  now  return  to  the  more  general  setting  of  Section  11.4.1,  with  the  additional  as¬ 
sumption  that  Tn  has  a  limit  normal  distribution  and  usual  \Z~N  asymptotics  apply. 

Conventional  asymptotic  methods  use  the  limit  cdf  Goo(t,  Fq)  as  an  approximation 
to  the  true  cdf  Gn(t,  Fq).  For  \[N -consistent  asymptotically  normal  estimators  this 
has  an  error  that  in  the  limit  behaves  as  a  multiple  of  N~  l/2.  We  write  this  as 

GN(t,  F0)  =  G«,(f,  F0)+O{N-1/2),  (11.15) 


where  in  our  example  G^d,  Fq)  =  4?(f). 

A  better  approximation  is  possible  using  an  Edgeworth  expansion.  Then 

GN{t,  Fq)  =  Goo(r,  F0)+8l(^Io)  +  g2(^Fo)  +  0(Ar3/2).  (11.16) 

Unfortunately,  as  already  noted,  the  functions  gi(-)  and  g2(-)  on  the  right-hand  side 
can  be  difficult  to  construct. 

Now  consider  the  bootstrap  estimator  Gw(t,  FN).  An  Edgeworth  expansion  yields 

£h(t,  Fm)  g2(t ,  Fm)  ,,, 

GN(t,  Fn)  =  Good,  Fn)  +  J^=L-  +  N  +  0(N~3/2y,  (11.17) 

see  Hall  (1992)  for  details.  The  bootstrap  estimator  Gff(t,  Fn)  is  used  to  approximate 
the  finite-sample  cdf  G^{t,  Fq).  Subtracting  (11.16)  from  (11.17),  we  get 


Gn(i,  Fn)  —  G N(t,  Fq)  —  [Goo(C  Fn)  —  Good,  To)] 

[gld,  FN)  -  gid,  Fo)] 


(11.18) 


+ 


Vn 


+  OiN-1). 


Assume  that  FN  is  \/~N  consistent  for  the  true  cdf  Fq,  so  that  FN  —  Fq  =  ()(N  ~  '/2). 
For  continuous  function  G^  the  first  term  on  the  right-hand  side  of  (11.18), 
[Goo(f,  Fn)  -  Goo(t,  F0)],  is  therefore  0(N~1/2),  so  GNd,  FN)  -  GNd ,  F0)  = 
0(N~l/2). 

The  bootstrap  approximation  GnG,  Fn)  is  therefore  in  general  no  closer  asymptot¬ 
ically  to  Gat(C  Fq)  than  is  the  usual  asymptotic  approximation  Goo(f,  Fq);  see  (11.15). 
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Now  suppose  the  statistic  TN  is  asymptotically  pivotal,  so  that  its  asymptotic  dis¬ 
tribution  Gqo  does  not  depend  on  unknown  parameters.  Here  this  is  the  case  if  TN  is 
standardized  so  that  its  limit  distribution  is  the  standard  normal.  Then  G<x>(f,  FN)  = 
Goo(T  Fq),  so  (1 1.18)  simplifies  to 

GN(t,  Fn)  -  GN(t,  F0)  =  N~l/2[gi(t,  Fn)  -  gi(f,  F0)]  +  OiN-1).  (11.19) 

However,  because  FN-F0  =  0(N -1'2)  we  have  that  [gift,  Fn)  —  gi(t,  Fq)]  = 
0(N~ll2)  for  continuous  in  F .  It  follows  upon  simplification  that  GmU ,  Fff)  = 
Gp/(t,  Fq)  +  0(N  1 ) .  The  bootstrap  approximation  G.vff  FN)  is  now  abetter  asymp¬ 
totic  approximation  to  G^{t,  Fq)  as  the  error  is  now  0(N~] ). 

In  summary,  for  a  bootstrap  on  an  asymptotically  pivotal  statistic  we  have 

GN(t,  F0)  =  GN(t ,  Fn)  +  0(N~l),  (11.20) 

an  improvement  on  the  conventional  approximation  Gn(1,  Fq)  =  Goo(L  Fq)  + 
0(N~l/2). 

The  bootstrap  on  an  asymptotically  pivotal  statistic  therefore  leads  to  an  improved 
small-sample  performance  in  the  following  sense.  Let  a  be  the  nominal  size  for  a  test 
procedure.  Usual  asymptotic  theory  produces  t-tests  with  actual  size  a  +  0(N ~'/2), 
whereas  the  bootstrap  produces  r-tests  with  actual  size  a  +  0(N  ]  ). 

For  symmetric  two-sided  hypothesis  tests  and  confidence  intervals  the  bootstrap  on 
an  asymptotically  pivotal  statistic  can  be  shown  to  have  approximation  error  0(N  3/2) 
compared  to  error  O  (N  1 )  using  usual  asymptotic  theory. 

The  preceding  results  are  restricted  to  asymptotically  normal  statistics.  For  chi- 
squared  distributed  test  statistics  the  asymptotic  gains  are  similar  to  those  for  sym¬ 
metric  two-sided  hypothesis  tests.  For  proof  of  bias  reduction  by  bootstrapping,  see 
Horowitz  (2001,  p.  3172). 

The  theoretical  analysis  leads  to  the  following  points.  The  bootstrap  should  be  from 
distribution  Fn  consistent  for  Fq.  The  bootstrap  requires  smoothness  and  continuity  in 
f’o  and  Gn,  so  that  a  modification  of  the  standard  bootstrap  is  needed  if,  for  example, 
there  is  a  discontinuity  because  of  a  boundary  constraint  on  the  parameters  such  as 
6  >  0.  The  bootstrap  assumes  existence  of  low-order  moments,  as  low-order  cumu- 
lants  appear  in  the  function  «|  in  the  Edgeworth  expansions.  Asymptotic  refinement 
requires  use  of  an  asymptotically  pivotal  statistic.  The  bootstrap  refinement  presented 
assumes  iid  data,  so  that  modification  is  needed  even  for  heteroskedastic  errors.  For 
more  complete  discussion  see  Horowitz  (2001). 


11.4.5.  Power  of  Bootstrapped  Tests 

The  analysis  of  the  bootstrap  has  focused  on  getting  tests  with  correct  size  in  small 
samples.  The  size  correction  of  the  bootstrap  will  lead  to  changes  in  the  power  of  tests, 
as  will  any  size  correction. 

Intuitively,  if  the  actual  size  of  a  test  using  first-order  asymptotics  exceeds  the  nom¬ 
inal  size,  then  bootstrapping  with  asymptotic  refinement  will  not  only  reduce  the  size 
toward  the  nominal  size  but,  because  of  less  frequent  rejection,  will  also  reduce  the 
power  of  the  test.  Conversely,  if  the  actual  size  is  less  than  the  nominal  size  then 
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bootstrapping  will  increase  test  power.  This  is  observed  in  the  simulation  exercise  of 
Horowitz  (1994,  p.  409).  Interestingly,  in  his  simulation  he  finds  that  although  boot¬ 
strapping  first-order  asymptotically  equivalent  tests  leads  to  tests  with  similar  actual 
size  (essentially  equal  to  the  nominal  size)  there  can  be  considerable  difference  in  test 
power  across  the  bootstrapped  tests. 


11.5.  Bootstrap  Extensions 

The  bootstrap  methods  presented  so  far  emphasize  smooth  ^iV-consistent  asymp¬ 
totically  normal  estimators  based  on  iid  data.  The  following  extensions  of  the  boot¬ 
strap  permit  for  a  wider  range  of  applications  a  consistent  bootstrap  (Sections  11.5.1 
and  11.5.2)  or  a  consistent  bootstrap  with  asymptotic  refinement  (Sections  11.5.3- 
1 1.5.5).  The  presentation  of  these  more  advanced  methods  is  brief.  Some  are  used  in 
Section  1 1.6. 


11.5.1.  Subsampling  Method 

The  subsampling  method  uses  a  sample  of  size  m  that  is  substantially  smaller  than 
the  sample  size  N.  The  subsampling  may  be  with  replacement  (Bickel,  Gotze,  and  van 
Zwet,  1997)  or  without  replacement  (Politis  and  Romano,  1994). 

Replacement  subsampling  provides  subsamples  that  are  random  samples  of  the  pop¬ 
ulation,  rather  than  random  samples  of  an  estimate  of  the  distribution  such  as  the  sam¬ 
ple  in  the  case  of  a  paired  bootstrap.  Replacement  subsampling  can  then  be  consistent 
when  failure  of  the  smoothness  conditions  discussed  in  Section  1 1 .4.2  leads  to  in¬ 
consistency  of  a  full  sample  bootstrap.  The  associated  asymptotic  error  for  testing  or 
confidence  intervals,  however,  is  of  higher  order  of  magnitude  than  the  usual  0(IV_  1/2 ) 
obtained  when  a  full  sample  bootstrap  without  refinement  can  be  used. 

Subsample  bootstraps  are  useful  when  full  sample  bootstraps  are  invalid,  or  as  a 
way  to  verify  that  a  full  sample  bootstrap  is  valid.  Results  will  differ  with  the  choice  of 
subsample  size.  And  there  is  a  considerable  increase  in  sample  error  because  a  smaller 
fraction  of  the  sample  is  being  used.  Indeed,  we  should  have  ( m /N)  — »■  0  and  N  — »■  oo. 
Politis,  Romano,  and  Wolf  (1999)  and  Horowitz  (2001)  provide  further  details. 


11.5.2.  Moving  Blocks  Bootstrap 

The  moving  blocks  bootstrap  is  used  for  data  that  are  dependent  rather  than  indepen¬ 
dent.  This  splits  the  sample  into  r  nonoverlapping  blocks  of  length  1,  where  rl  ~  N. 
First,  one  samples  with  replacement  from  these  blocks,  to  give  r  new  blocks,  which 
will  have  a  different  temporal  ordering  from  the  original  r  blocks.  Then  one  estimates 
the  parameters  using  this  bootstrap  sample. 

The  moving  blocks  method  treats  the  randomly  drawn  blocks  as  being  independent 
of  each  other,  but  allows  dependence  within  the  blocks.  A  similar  blocking  was  ac¬ 
tually  used  by  Anderson  (1971)  to  derive  a  central  limit  theorem  for  an  m-dependent 
process.  The  moving  blocks  process  requires  r  — >  oo  as  N  — >  oo  to  ensure  that  we 
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are  likely  to  draw  consecutive  blocks  uncorrelated  with  each  other.  It  also  requires  the 
block  length  1  oo  as  ./V  oo.  See,  for  example,  Gotze  and  Kiinsch  (1996). 

11.5.3.  Nested  Bootstrap 

A  nested  bootstrap,  introduced  by  Hall  (1986),  Beran  (1987),  and  Loh  (1987),  is 
a  bootstrap  within  a  bootstrap.  This  method  is  especially  useful  if  the  bootstrap  is 
on  a  statistic  that  is  not  asympotically  pivotal.  In  particular,  if  the  standard  error  of 
the  estimate  is  difficult  to  compute  one  can  bootstrap  the  current  bootstrap  sample 
to  obtain  a  bootstrap  standard  error  estimate  Sg*  Boot  and  form  t*  =  (0  —  9 ) /.Sg *  Boot, 
and  then  apply  the  percentile-r  method  to  the  bootstrap  replications  t* ,  . ...  t  'l;.  This 
permits  asymptotic  refinements  where  a  single  round  of  bootstrap  would  not. 

More  generally,  iterated  bootstrapping  is  a  way  to  improve  the  performance  of 
the  bootstrap  by  estimating  the  errors  (i.e.,  bias)  that  arise  from  a  single  pass  of  the 
bootstrap,  and  correcting  for  these  errors.  In  general  each  further  iteration  of  the  boot¬ 
strap  reduces  bias  by  a  factor  N  1  if  the  statistic  is  asymptotically  pivotal  and  by  a 
factor  N~ 1/2  otherwise.  For  a  good  exposition  see  Hall  and  Martin  (1988).  If  B  boot¬ 
straps  are  performed  at  each  iteration  then  Bk  bootstraps  need  to  be  performed  if  there 
are  k  iterations.  For  this  reason  at  most  two  iterations,  called  a  double  bootstrap  or 
calibrated  bootstrap,  are  done. 

Davison,  Hinkley,  and  Schechtman  (1986)  proposed  balanced  bootstrapping.  This 
method  ensures  that  each  sample  observation  is  reused  exactly  the  same  number  of 
times  over  all  B  bootstraps,  leading  to  better  bootstrap  estimates.  For  implementation 
see  Gleason  (1988),  whose  algorithms  add  little  to  computational  time  compared  to 
the  usual  unbalanced  bootstrap. 

11.5.4.  Recentering  and  Rescaling 

To  yield  an  asymptotic  refinement  the  bootstrap  should  be  based  on  an  estimate  F  of 
the  dgp  Fq  that  imposes  all  the  conditions  of  the  model  under  consideration.  A  leading 
example  arises  with  the  residual  bootstrap. 

Least-squares  residuals  do  not  sum  to  zero  in  nonlinear  models,  or  even  in  lin¬ 
ear  models  if  there  is  no  intercept.  The  residual  bootstrap  (see  Section  1 1 .2.4)  based 
on  least-squares  residuals  will  then  fail  to  impose  the  restriction  that  E[«,  ]  =  0.  The 
residual  bootstrap  should  instead  bootstrap  the  recentered  residual  w)  —  u,  where 
u  =  N~l  -  Similar  recentering  should  be  done  for  paired  bootstraps  of  GMM 

estimators  in  overidentified  models  (see  Section  1 1.6.4). 

Rescaling  of  residuals  can  also  be  useful.  For  example,  in  the  linear  regression 
model  with  iid  errors  resample  from  ( N/(N  —  AT ) ) 1  ■' 2 77,  since  these  have  variance  .v2. 
Other  adjustments  include  using  the  standardized  residual  Tij/y](  I  —  ha)s2,  where  ha 
is  the  ;th  diagonal  entry  in  the  projection  matrix  X(X'X)  1 X'. 

11.5.5.  The  Jackknife 

The  bootstrap  can  be  used  for  bias  correction  (see  Section  1 1.2.8).  An  alternative  re¬ 
sampling  method  is  the  jackknife,  a  precursor  of  the  bootstrap.  The  jackknife  uses  N 
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deterministically  defined  subsamples  of  size  N  —  1  obtained  by  dropping  in  turn  each 
of  the  N  observations  and  recomputing  the  estimator. 

To  see  how  the  jackknife  works,  let  9N  denote  the  estimate  of  9  using  all  N  obser¬ 
vations,  and  let  6fy-  i  denote  the  estimate  of  9  using  the  first  (N  —  1)  observations. 
If  (11.7)  holds  then  E[?w]  =  9  +  aN / N  +  bN/N 2  +  0(N ~3)  and  E[?W_J  =9  + 
aN /(N  -  1)  +  bN/(N^—  l)2  +  0(N~2),  which  implies  E[A?W  —  (N  —  1)?^]  = 
9  +  0(N~2).  Thus  N9n  —  (N  —  1)9 N_\  has  smaller  bias  than  9N. 

The  estimator  can  be  more  variable,  however,  as  it  uses  less  of  the  data.  As  an 
extreme  example,  if  9  =  y  then  the  new  estimator  is  simply  _y,y.  the  A'tli  observation. 
The  variation  can  be  reduced  by  dropping  each  observation  in  turn  and  averaging. 

More  formally  then,  consider  the  estimator  6  of  a  parameter  vector  9  based  on  a 
sample  of  size  N  from  iid  data.  For  i  =  1, . . . ,  N  sequentially  delete  the  ith  observa¬ 
tion  and  obtain  N  jacknife  replication  estimates  0(_,-)  from  the  N  jackknife  resamples 
of  size  (N  —  1).  The  jacknife  estimate  of  the  bias  of  6  is  (N  —  1  )(9  —  9),  where 

0  =  N~l  £,  9{  _ i  j  is  the  average  of  the  N  jacknife  replications  0(_r>  The  bias  appears 

large  because  of  multiplication  by  (N  —  1),  but  the  differences  (9t_l)  —  9)  are  much 
smaller  than  in  the  bootstrap  case  since  a  jackknife  resample  differs  from  the  original 
sample  in  only  one  observation. 

This  leads  to  the  bias-corrected  jackknife  estimate  of  9: 


9]ack  =  9-(N-)){9-9) 
=  N9-  (N  -  1)9. 


(11.21) 


This  reduces  the  bias  from  0(N  1 )  to  0(N  2),  which  is  the  same  order  of  bias  re¬ 
duction  as  for  the  bootstrap.  It  is  assumed  that,  as  for  the  bootstrap,  the  estimator  is 
a  smooth  y/V-consi stent  estimator.  The  jackknife  estimate  can  have  increased  vari¬ 
ance  compared  with  9,  and  examples  where  the  jackknife  fails  are  given  in  Miller 
(1974). 

A  simple  example  is  estimation  of  a 2  from  an  iid  sample  with  y,  ~  [/x,  a2].  The  es¬ 
timate  a2  =  N  ~ 1  J2i(yi  ~  J)2’  the  MLE  under  normality,  has  E[cr2]  =  a2{N  —  1  )/N 
so  that  the  bias  equals  cr2/N,  which  is  0(N~l).  In  this  example  the  jackknife  estimate 
can  be  shown  to  simplify  to  ef2ack  =  (N  —  1)_1  —  y)2,  so  one  does  not  need  not 

to  compute  N  separate  estimates  ct2__;).  This  is  an  unbiased  estimate  of  a2,  so  the  bias 
is  actually  zero  rather  than  the  general  result  of  0(N~2). 

The  jackknife  is  due  to  Quenouille  (1956).  Tukey  (1958)  considered  application  to 
a  wider  range  of  statistics.  In  particular,  the  jackknife  estimate  of  the  standard  error 
of  an  estimator  9  is 


Sejack[d]  — 


N  —  1 
N 


E(Vo  -^)2 


n  1/2 


(11.22) 


Tukey  proposed  the  term  jackknife  by  analogy  to  a  Boy  Scout  jackknife  that  solves 
a  variety  of  problems,  each  of  which  could  be  solved  more  efficiently  by  a  specially 
constructed  tool.  The  jackknife  is  a  “rough  and  ready”  method  for  bias  reduction  in 
many  situations,  but  it  is  not  the  ideal  method  in  any.  The  jackknife  can  be  viewed  as  a 
linear  approximation  of  the  bootstrap  (Efron  and  Tibsharani,  1993,  p.  146).  It  requires 
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less  computation  than  the  bootstrap  in  small  samples,  as  then  N  <  B  is  likely,  but  is 
outperformed  by  the  bootstrap  as  B  — >  oo. 

Consider  the  linear  regression  model  y  =  X/3  +  u,  with  f3  =  (X'X)  1  X'v.  An  ex¬ 
ample  of  a  biased  estimator  from  OLS  regression  is  a  time-series  model  with  lagged 
dependent  variable  as  regressor.  The  regression  estimator  based  on  the  /  th  jackknife 
sample  (X(_0,  y(_0)  is  given  by 

A-i)  =  [X'_0X(_0]-1X'_0y(_0 

=  [X'X  -  x,X;]_1(X'y  -  x,- y, ) 

=  3  -  [X'X]_1xi(yI  -  xj3(_0). 


The  third  equality  avoids  the  need  to  invert  XJ_(-jX(_,-)  for  each  i  and  is  obtained  using 


[X'X]-1  =  [X'(  ,X(_0] 


[X'(_0X(_0 [X'_0X(_0]- 


i  +  x;[x;_,,x(_n] 


-i 


Here  the  pseudo- values  are  given  by  N/3  —  (N  —  1)  /T,  _(  ),  and  the  jackknife  estimator 
of  /3  is  given  by 


Aack  =  W/3  -  (AT  -  l)^EA-0- 


(11.23) 


(=i 


An  interesting  application  of  the  jackknife  to  bias  reduction  is  the  jackknife  IV 
estimator  (see  Section  6.4.4). 


11.6.  Bootstrap  Applications 

We  consider  application  of  the  bootstrap  taking  into  account  typical  microeconometric 
complications  such  as  heteroskedasticity  and  clustering  and  more  complicated  estima¬ 
tors  that  can  lead  to  failure  of  simple  bootstraps. 

11.6.1.  Heteroskedastic  Errors 

For  least  squares  in  models  with  additive  errors  that  are  heteroskedastic,  the  standard 
procedure  is  to  use  White’s  heteroskedastic-consistent  covariance  matrix  estimator 
(HCCME).  This  is  well  known  to  perform  poorly  in  small  samples.  When  done  cor¬ 
rectly,  the  bootstrap  can  provide  an  improvement. 

The  paired  bootstrap  leads  to  valid  inference,  since  the  essential  assumption  that 
(y,- ,  x,)  is  iid  still  permits  V[m,  |x,  |  to  vary  with  x,  (see  Section  4.4.7).  However,  it 
does  not  offer  an  asymptotic  refinement  because  it  does  not  impose  the  condition  that 
E[w,|x;]  =  0. 

The  usual  residual  bootstrap  actually  leads  to  invalid  inference,  since  it  assumes 
that  u j  |  X/  is  iid  and  hence  erroneously  imposes  the  condition  of  homoskedastic  er¬ 
rors.  In  terms  of  Section  1 1.4  theory,  F  is  then  inconsistent  for  F.  One  can  specify  a 
formal  model  for  heteroskedasticity,  say  u,  =  cxpiz'o'je,,  where  s,  are  iid,  obtain  esti¬ 
mate  exp(zj a),  and  then  bootstrap  the  implied  residuals  e, .  Consistency  and  asymptotic 
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refinement  of  this  bootstrap  requires  correct  specification  of  the  functional  form  for  the 
heteroskedasticity. 

The  wild  bootstrap,  introduced  by  Wu  (1986)  and  Liu  (1988)  and  studied  further 
by  Mammen  (1993),  provides  asymptotic  refinement  without  imposing  such  structure 
on  the  heteroskedasticity.  This  bootstrap  replaces  the  OLS  residual m,  by  the  following 
residual: 


1  /%,■  ~  -0.6180m,  with  probability  ~  0.7236, 

[1  -  ~  1.6180m,  with  probability  1  -  ^  ~  0.2764. 


Taking  expectations  with  respect  to  only  this  two-point  distribution  and  perform¬ 
ing  some  algebra  yields  E[m*]  =  0,  E[u*2]  =  Tij,  and  E[m*3  |  =  m3.  Thus  u*  leads 
to  a  residual  with  zero  conditional  mean  as  desired,  since  E[m*|m,  ,  x,]  =  0  implies 
E[m*|x,  |  =  0,  while  the  second  and  third  moments  are  unchanged. 

The  wild  bootstrap  resamples  have  ith  observation  (y* ,  x,),  where  y*  =  x-/3  +  u* . 
The  resamples  vary  because  of  different  realizations  of  m?.  Simulations  by  Horowitz 
(1997,  2001)  show  that  this  bootstrap  works  much  better  than  a  paired  bootstrap  when 
there  is  heteroskedasticity  and  works  well  compared  to  other  bootstrap  methods  even 
if  there  is  no  heteroskedasticity. 

It  seems  surprising  that  this  bootstrap  should  work  because  for  the  ith  observa¬ 
tion  it  draws  from  only  two  possible  values  for  the  residual,  —0.6180m,  or  1.6180m,  . 
However,  a  similar  draw  is  being  made  over  all  N  observations  and  over  B  bootstrap 
iterations.  Recall  also  that  White’s  estimator  replaces  E[ uj  |  by  iij,  which,  although 
incorrect  for  one  observation,  is  valid  when  averaged  over  the  sample.  The  wild  boot¬ 
strap  is  instead  drawing  from  a  two-point  distribution  with  mean  0  and  variance  Ttj. 


11.6.2.  Panel  Data  and  Clustered  Data 
Consider  a  linear  panel  regression  model 

Jit  =  w  u'0+Uit, 

where  i  denotes  individual  and  t  denotes  time  period.  Following  the  notation  of  Sec¬ 
tion  21.2.3,  the  tilda  is  added  as  the  original  data  v,-,  and  x,r  may  first  be  transformed 
to  eliminate  fixed  effects,  for  example.  We  assume  that  the  errors  uit  are  independent 
over  i,  though  they  may  be  heteroskedastic  and  correlated  over  t  for  given  i. 

If  the  panel  is  short,  so  that  T  is  finite  and  asymptotic  theory  relies  on  N  — »■  oo, 
then  consistent  standard  errors  for  6  can  be  obtained  by  a  paired  or  EDF  bootstrap 
that  resamples  over  i  but  does  not  resample  over  /.  In  the  preceding  presentation  w, 
becomes  [y(1 ,  x,  i,  . . . ,  )’/  t  ,  x/  r  \  and  we  resample  over  i  and  obtain  all  T  observations 
for  the  chosen  i. 

This  panel  bootstrap,  also  called  a  block  bootstrap,  can  also  be  applied  to  the 
nonlinear  panel  models  of  Chapter  23.  The  key  assumptions  are  that  the  panel  is  short 
and  the  data  are  independent  over  i.  More  generally,  this  bootstrap  can  be  applied 
whenever  data  are  clustered  (see  Section  24.5),  provided  cluster  size  is  finite  and  the 
number  of  clusters  goes  to  infinity. 
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The  panel  bootstrap  produces  standard  errors  that  are  asymptotically  equivalent  to 
panel  robust  sandwich  standard  errors  (see  Section  21.2.3).  It  does  not  provide  an 
asymptotic  refinement.  However,  it  is  quite  simple  to  implement  and  is  practically  very 
useful  as  many  packages  do  not  automatically  provide  panel  robust  standard  errors 
even  for  quite  basic  panel  estimators  such  as  the  fixed  effects  estimator.  Depending 
on  the  application,  other  bootstraps  such  as  parametric  and  residual  bootstraps  may  be 
possible,  provided  again  that  resampling  is  over  i  only. 

Asymptotic  refinement  is  straightforward  if  the  errors  are  iid.  More  realistically, 
however,  ult  will  be  heteroskedastic  and  correlated  over  t  for  given  i.  The  wild  boot¬ 
strap  (see  Section  11.6.1)  should  provide  an  asymptotic  refinement  in  a  linear  model 
if  the  panel  is  short.  Then  wild  bootstrap  resamples  have  (i,  t  )th  observation  (y*t,  w If), 
where  y*  =  w j/9+uit,  Tilt  =  yit  —  w,/0  and  uit  is  a  draw  from  the  two-point  distri¬ 
bution  given  in  Section  11.6.1. 

11.6.3.  Hypothesis  and  Specification  Tests 

Section  1 1.2.6  focused  on  tests  of  the  hypothesis  6  =  8q.  Here  we  consider  more  gen¬ 
eral  tests.  As  in  Section  11.2.6,  the  bootstrap  can  be  used  to  perform  hypothesis  tests 
with  or  without  asymptotic  refinement. 


Tests  without  Asymptotic  Refinement 


A  leading  example  of  the  usefulness  of  the  bootstrap  is  the  Hausman  test  (see  Sec¬ 
tion  8.3).  Standard  implementation  of  this  test  requires  estimation  of  V[ 0  —  6],  where 
6  and  9  are  the  two  estimators  being  contrasted.  Obtaining  this  estimate  can  be  diffi¬ 
cult  unless  the  strong  assumption  is  made  that  one  of  the  estimators  is  fully  efficient 
under  Hq.  The  paired  bootstrap  can  be  used  instead,  leading  to  consistent  estimate 


vBoot[0  -0]  =  — -  Y,[Cel  -  el)  -  (6  -  e*mel  -  el)  -  (e  -  e*)]', 

B  1  b=  1 

where  9  =  B  1  Y2h  and  &  =  9  1  9*h.  Then  compute 


H  =  (9  -  ey  (vBoot  [9  -  9\)  1  (e  -  9) 


(11.24) 


and  compare  to  chi-square  critical  values.  As  mentioned  in  Chapter  8,  a  generalized 
inverse  may  need  to  be  used  and  care  may  be  needed  to  ensure  chi-square  critical 
values  are  obtained  using  the  correct  degrees  of  freedom. 

More  generally,  this  approach  can  be  used  for  any  standard  normal  test  or  chi-square 
distributed  test  where  implementation  is  difficult  because  a  variance  matrix  must  be 
estimated.  Examples  include  hypothesis  tests  based  on  a  two-step  estimator  and  the 
m-tests  of  Chapter  8. 


Tests  with  Asymptotic  Refinement 

Many  tests,  especially  those  for  fully  parametric  models  such  as  the  LM  test  and  IM 
test,  can  be  simply  implemented  using  an  auxiliary  regression  (see  Sections  7.3.5  and 
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8.2.2).  The  resulting  test  statistics,  however,  perform  poorly  in  finite  samples  as  docu¬ 
mented  in  many  Monte  Carlo  studies.  Such  test  statistics  are  easily  computed  and  are 
asymptotically  pivotal  as  the  chi-square  distribution  does  not  depend  on  unknown  pa¬ 
rameters.  They  are  therefore  prime  candidates  for  asymptotic  refinement  by  bootstrap. 

Consider  the  m-test  of  H{)  :  E[m,(y,  |x,  ,  8 )]  =  0  against  Ha  :  E[m,(y,  |x(  ,  8 )]  f  0 
(see  Section  8.2).  From  the  original  data  estimate  8  by  ML,  and  calculate  the  test 
statistic  M.  Using  a  parametric  bootstrap,  resample  y*  from  the  fitted  conditional  den¬ 
sity  /(>’,■  \x;,8),  for  fixed  regressors  in  repeated  samples,  or  from  f(yi\x*  .8).  Compute 
M£,  b  =  1, . . . ,  B,  in  the  bootstrap  resamples.  Reject  Hq  at  level  a  if  the  original  cal¬ 
culated  statistic  M  exceeds  the  a  quantile  of  M;*,  b  =  1, . . . ,  B. 

Horowitz  (1994)  presented  this  bootstrap  for  the  IM  test  and  demonstrated  with 
simulation  examples  that  there  are  substantial  finite-sample  gains  to  this  bootstrap.  A 
detailed  application  by  Drukker  (2002)  to  specification  tests  for  the  tobit  model  sug¬ 
gests  that  conditional  moment  specification  tests  can  be  easily  applied  to  fully  para¬ 
metric  models,  since  any  size  distortion  in  the  auxiliary  regressions  can  be  corrected 
through  bootstrap. 

Note  that  bootstrap  tests  without  asymptotic  refinement,  such  as  the  Hausman  test 
given  here,  can  be  refined  by  use  of  the  nested  bootstrap  given  in  Section  1 1.5.3. 


11.6.4.  GMM,  Minimum  Distance,  and  Empirical  Likelihood  in 

Overidentified  Models 

The  GMM  estimator  is  based  on  population  moment  conditions  E[h(w, ,  8)\  =0 
(see  Section  6.3.1).  In  a  just-identified  model  a  consistent  estimator  simply  solves 
N  1  JT  h(w,-,  8)  =  0.  In  overidentified  models  this  estimator  is  no  longer  feasible. 
Instead,  the  GMM  estimator  is  used  (see  Section  6.3.2). 

Now  consider  bootstrapping,  using  the  paired  or  EDF  bootstrap.  For  GMM  in  an 
overidentified  model  N  1  JT  h(w,,  8)  f  0,  so  this  bootstrap  does  not  impose  on  the 
bootstrap  resamples  the  original  population  restriction  that  E[h(w,  ,  8)]  =  0.  As  a  re¬ 
sult  even  if  the  asymptotically  pivotal  f-statistic  is  used  there  is  no  longer  a  bootstrap 
refinement,  though  bootstraps  on  8  and  related  confidence  intervals  and  f-test  statis¬ 
tics  remain  consistent.  More  fundamentally,  the  bootstrap  of  the  OIR  test  (see  Sec¬ 
tion  6.3.8)  can  be  shown  to  be  inconsistent.  We  focus  on  cross-section  data  but  similar 
issues  arise  for  panel  GMM  estimators  (see  Chapter  22)  in  overidentified  models. 

Hall  and  Horowitz  (1996)  propose  correcting  this  by  recentering.  Then  the  boot¬ 
strap  is  based  on  h*(w,  ,  8)  =  h(w,  ,  8)— N~l  h(w,  ,  8)  and  asymptotic  refinements 
can  be  obtained  for  statistics  based  on  8  including  the  OIR  test. 

Horowitz  (1998)  does  similar  recentering  for  the  minimum  distance  estimator  (see 
Section  6.7).  He  then  applies  the  bootstrap  to  the  covariance  structure  example  of 
Altonji  and  Segal  (1996)  discussed  in  Section  6.3.5. 

An  alternative  adjustment  proposed  by  Brown  and  Newey  (2002)  is  to  not  recenter 
but  to  instead  resample  the  observations  w,  with  probabilities  that  vary  across  observa¬ 
tions  rather  than  using  equal  weights  l/N.  Specifically,  let  Pr[w*=  w,]  =  n) ,  where 
tc j  =  (1  +  A  h,-),  h,  =  h(w, ,  8),  and  A  maximizes  JT  ln(l  +  A  h, ).  The  motivation  is 
that  the  probabilities  tt,  equivalently  are  the  solution  to  an  empirical  likelihood  (EL) 
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problem  (see  Section  6.8.2)  of  maximizing  In  7T,  with  respect  to  tx\  , . . . ,  jtn  subject 
to  the  constraints  JT  7r,h,  =  0  and  Ui  =  1.  This  empirical  likelihood  bootstrap 
of  the  GMM  estimator  therefore  imposes  the  constraint  7r,h,  =  0. 

One  could  instead  work  directly  with  EL  from  the  beginning,  letting  6  be  the  EL 
estimator  rather  than  the  GMM  estimator.  The  advantage  of  the  Brown  and  Newey 
(2002)  approach  is  that  it  avoids  the  more  challenging  computation  of  the  EL  estimator. 
Instead,  one  needs  only  the  GMM  estimator  and  solution  of  the  concave  programming 
problem  of  minimizing  JT  ln(l  +  A  h(  ). 


11.6.5.  Nonparametric  Regression 

Nonparametric  density  and  regression  estimators  converge  at  rate  less  than  s/fV  and 
are  asymptotically  biased.  This  complicates  inference  such  as  confidence  intervals  (see 
Sections  9.3.7  and  9.5.4). 

We  consider  the  kernel  regression  estimator  m(x o)  of  m(x o)  =  E[_y  j.r  =  X(,  ]  for  ob¬ 
servations  (y,  x)  that  are  iid,  though  conditional  heteroskedasticity  is  permitted.  From 
Horowitz  (2001,  p.  3204),  an  asymptotically  pivotal  statistic  is 

m(x0)  -  m(x0) 
t  —  - , 

sm(x 0) 

where  m(xo)  is  an  undersmoothed  kernel  regression  estimator  with  bandwidth  h  = 
o(N-V3)  rather  than  the  optimal  h*  =  0(N  1  /5 )  and 


1 


Am(x  0) 


Nh[f(x o)]2 


Y^(yi  -  m(xi))2K 


2 


where  f(x o)  is  a  kernel  estimate  of  the  density  fix)  at  x  =  xq-  A  paired  bootstrap 
resamples  (y*,x*)  and  forms  tf  =  [m*h(x0)  -  mixo^/s^^,  where  si( Xo)  b  is  com¬ 
puted  using  bootstrap  sample  kernel  estimates  fh*b{xi)  and  ff(x o).  The  percentile-? 
confidence  interval  of  Section  11.2.7  then  provides  an  asymptotic  refinement.  For  a 
symmetrical  confidence  interval  or  symmetrical  test  at  level  a  the  error  is  o((Nh  1 )) 
rather  than  0((Nh~1))  using  first-order  asymptotic  approximation. 

Several  variations  on  this  bootstrap  are  possible.  Rather  than  using  undersmoothing, 
bias  can  be  eliminated  by  directly  estimating  the  bias  term  given  in  Section  9.5.2. 
Also  rather  than  using  s~l(x  the  variance  term  given  in  Section  9.5.2  can  be  directly 
estimated. 

Yatchew  (2003)  provides  considerable  detail  on  implementing  the  bootstrap  in  non¬ 
parametric  and  semiparametric  regression. 


11.6.6.  Nonsmooth  Estimators 

From  Section  11.4.2  the  bootstrap  assumes  smoothness  in  estimators  and  statistics. 
Otherwise  the  bootstrap  may  not  offer  an  asymptotic  refinement  and  may  even  be 
invalid. 

As  illustration  we  consider  the  LAD  estimator  and  extension  to  binary  data.  The 
LAD  estimator  (see  Section  4.6.2)  has  objective  function  JT  [ y,  —  x-/3|  that  has 
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discontinuous  first  derivative.  A  bootstrap  can  provide  a  valid  asymptotic  approx¬ 
imation  but  does  not  provide  an  asymptotic  refinement.  For  binary  outcomes,  the 
LAD  estimator  extends  to  the  maximum  score  estimator  of  Manski  (1975)  (see 
Section  14.7.2).  For  this  estimator  the  bootstrap  is  not  even  consistent. 

In  these  examples  bootstraps  with  asymptotic  refinements  can  be  obtained  by  us¬ 
ing  a  smoothed  version  of  the  original  objective  function  for  the  estimator.  For  ex¬ 
ample,  the  smoothed  maximum  score  estimator  of  Horowitz  (1992)  is  presented  in 
Section  14.7.2. 


11.6.7.  Time  Series 

The  bootstrap  relies  on  resampling  from  an  iid  distribution.  Time-series  data  therefore 
present  obvious  problems  as  the  result  of  dependence. 

The  bootstrap  is  straightforward  in  the  linear  model  with  an  ARMA  error  structure 
and  resampling  the  underlying  white  noise  error.  As  an  example,  suppose  y,  =  fix,  + 
ut,  where  u,  =  pu,  i  +  s,  and  s,  is  white  noise.  Then  given  estimates  /I  and  jo  we 
can  recursively  compute  residuals  as"er  =  u,  —  'put-\  =  y*  —  xtfi  ~  P()7-i  —  xt~ \P)- 
Bootstrapping  these  residuals  to  give's*,  t  =  I , ...  ,  7’,  we  can  then  recursively  com¬ 
pute  u*  =  pu*_ !  +  e*  and  hence  y*  =  fix,  +  u* .  Then  regress  y*  on  x,  with  AR(1) 
error.  An  early  example  was  presented  by  Freedman  (1984),  who  bootstrapped  a  dy¬ 
namic  linear  simultaneous  equations  regression  model  estimated  by  2SLS.  Given  lin¬ 
earity,  simultaneity  adds  little  problems.  The  dynamic  nature  of  the  model  is  handled 
by  recursively  constructing  y*  =  u*),  where  u*  are  obtained  by  resampling 

from  the  2SLS  structural  equation  residuals  and  yjj  =  yo.  Then  perform  2SLS  on  each 
bootstrap  sample. 

This  method  assumes  the  underlying  error  is  iid.  For  general  dependent  data  without 
an  ARMA  specification,  for  example,  nonstationary  data,  the  moving  blocks  bootstrap 
presented  in  Section  1 1.5.2  can  be  used. 

For  testing  unit  roots  or  cointegration  special  care  is  needed  in  applying  the  boot¬ 
strap  as  the  behavior  of  the  test  statistic  changes  discontinuously  at  the  unit  root. 
See,  for  example,  Li  and  Maddala  (1997).  Although  it  is  possible  to  implement  a 
valid  bootstrap  in  this  situation,  to  date  these  bootstraps  do  not  provide  an  asymptotic 
refinement. 


11.7.  Practical  Considerations 

The  bootstrap  without  asymptotic  refinement  can  be  a  very  useful  tool  for  the  applied 
researcher  in  situations  where  it  is  difficult  to  perform  inference  by  other  means.  This 
need  can  vary  with  available  software  and  the  practitioner’s  tool  kit.  The  most  common 
application  of  the  bootstrap  to  date  is  computation  of  standard  errors  needed  to  conduct 
a  Wald  hypothesis  test.  Examples  include  heteroskedasticity-robust  and  panel-robust 
inference,  inference  for  two-step  estimators,  and  inference  on  transformations  of  es¬ 
timators.  Other  potential  applications  include  computation  of  m-test  statistics  such  as 
the  Hausman  test. 
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The  bootstrap  can  additionally  provide  an  asymptotic  refinement.  Many  Monte 
Carlo  studies  show  that  quite  standard  procedures  can  perform  poorly  in  finite  sam¬ 
ples.  There  appears  to  be  great  potential  for  use  of  bootstrap  refinements,  currently 
unrealized.  In  some  cases  this  could  improve  existing  inference,  such  as  use  of  the 
wild  bootstrap  in  models  with  additive  errors  that  are  heteroskedastic.  In  other  cases  it 
should  encourage  increased  use  of  methods  that  are  currently  under-utilized.  In  partic¬ 
ular,  model  specification  tests  with  good  small-sample  properties  can  be  implemented 
by  bootstrapping  easily  computed  auxiliary  regressions. 

There  are  two  barriers  to  the  use  of  the  bootstrap.  First,  the  bootstrap  is  not  always 
built  into  statistical  packages.  This  will  change  over  time,  and  for  now  constructing 
code  for  a  bootstrap  is  not  too  difficult  provided  the  package  includes  looping  and  the 
ability  to  save  regression  output.  Second,  there  are  subtleties  involved.  Asymptotic  re¬ 
finement  requires  use  of  an  asymptotically  pivotal  statistic  and  the  simplest  bootstraps 
presume  iid  data  and  smoothness  of  estimators  and  statistics.  This  covers  a  wide  class 
of  applications  but  not  all  applications. 


11.8.  Bibliographic  Notes 

The  bootstrap  was  proposed  by  Efron  (1979)  for  the  iid  case.  Singh  (1981)  and  Bickel  and 
Freedman  (1981)  provided  early  theory.  A  good  introductory  statistics  treatment  is  by  Efron 
and  Tibsharani  (1993),  and  a  more  advanced  treatment  is  by  Hall  (1992).  Extensions  to 
the  regression  case  were  considered  early  on;  see,  for  example,  Freedman  (1984).  Most  of 
the  work  by  econometricians  has  occurred  in  the  past  10  years.  The  survey  of  Horowitz 
(2001)  is  very  comprehensive  and  is  well  complemented  by  the  survey  of  Brownstone  and 
Kazimi  (1998),  which  considers  many  econometrics  applications,  and  the  paper  by  MacKinnon 
(2002). 


- Exercises - 

11-1  Consider  the  model  y=  a  +  px  +  e,  where  a,  p,  and  x  are  scalars  and  e~ 
Af[0,  <t2].  Generate  a  sample  of  size  N  =  20  with  a  =  2,  p  =  1 ,  and  <r2  =  1  and 
suppose  that  x  ~  Af[2,  2].  We  wish  to  test  H0  :  p  =  1  against  Ha  :  p  ^  1  at  level 
0.05  using  the  f-statistic  t  =  (p  -  1)/se[/J].  Do  as  much  of  the  following  as  your 
software  permits.  Use  B  =  499  bootstrap  replications. 

(a)  Estimate  the  model  by  OLS,  giving  slope  estimate  'p. 

(b)  Use  a  paired  bootstrap  to  compute  the  standard  error  and  compare  this  to 
the  original  sample  estimate.  Use  the  bootstrap  standard  error  to  test  H0. 

(c)  Use  a  paired  bootstrap  with  asymptotic  refinement  to  test  H0. 

(d)  Use  a  residual  bootstrap  to  compute  the  standard  error  and  compare  this  to 
the  original  sample  estimate.  Use  the  bootstrap  standard  error  to  test  H0. 

(e)  Use  a  residual  bootstrap  with  asymptotic  refinement  to  test  H0. 

11-2  Generate  a  sample  of  size  20  according  from  the  following  dgp.  The  two  regres¬ 
sors  are  generated  by  x^  ~  x2(4)  -  4  and  x2  ~  3.5  +  U[  1 , 2];  the  error  is  from  a 
mixture  of  normals  with  u~A/'[0,  25]  with  probability  0.3  and  u~A/'[0,  5]  with 
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probability  0.7;  and  the  dependent  variable  is  y=  1.3xi  +  0.7x2  +  0.5u. 

(a)  Estimate  by  OLS  the  model  y  =  p o  +  P^x■^  +  yS2x2  +  u. 

(b)  Suppose  we  are  interested  in  estimating  the  quantity  y  =  Pi  +  p%  from  the 
data.  Use  the  least-squares  estimates  to  estimate  this  quantity.  Use  the 
delta  method  to  obtain  approximate  standard  error  for  this  function. 

(c)  Then  estimate  the  standard  error  of  y  using  a  paired  bootstrap.  Compare 
this  to  se[y]  from  part  (b)  and  explain  the  difference.  For  the  bootstrap  use 
B=  25  and  6=  200. 

(d)  Now  test  H0  :  y  =  1 .0  at  level  0.05  using  a  paired  bootstrap  with  B  =  999. 
Perform  bootstrap  tests  without  and  with  asymptotic  refinement. 

11-3  Use  200  observations  from  the  Section  4.6.4  data  on  natural  logarithm  of  health 

expenditure  (y)  and  natural  logarithm  of  total  expenditure  (x).  Obtain  OLS  esti¬ 
mates  of  the  model  y  =  a  +  px  +  u.  Use  the  paired  bootstrap  with  B  =  999. 

(a)  Obtain  a  bootstrap  estimate  of  the  standard  error  of  p. 

(b)  Use  this  standard  error  estimate  to  test  H0  :  p  =  1  against  Ha  :  p  /  1 . 

(c)  Do  a  bootstrap  test  with  refinement  of  H0  :  p  =  1  against  Ha  :  /f  /  1  under 
the  assumption  that  u  is  homoskedastic. 

(d)  If  u  is  heteroskedastic  what  happens  to  your  method  in  (c)?  Is  the  test  still 
asymptotically  valid,  and  if  so  does  it  offer  an  asymptotic  refinement? 

(e)  Do  a  bootstrap  to  obtain  a  bias-corrected  estimate  of  p. 
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Simulation-Based  Methods 


12.1.  Introduction 

The  nonlinear  methods  presented  in  the  preceding  chapters  do  not  require  closed-form 
solutions  for  the  estimator.  Nonetheless,  they  rely  considerably  on  analytical  tractabil- 
ity.  In  particular,  the  objective  function  for  the  estimator  has  been  assumed  to  have  a 
closed-form  expression,  and  the  asymptotic  distribution  of  the  estimator  is  based  on  a 
linearization  of  the  estimating  equations. 

In  the  current  chapter  we  present  simulation-based  estimation  methods.  The  treat¬ 
ment  of  ML  estimation  in  Chapter  5  presumed  that  the  density  /(y|x,  0)  has  a  closed- 
form  expression.  If  there  is  no  closed-form  solution,  ML  estimation  may  still  be 
possible  if  we  instead  use  a  good  approximation  /(y|x,  9)  of  f(y  |x,  9)  to  form  the 
likelihood  function.  A  common  reason  for  lack  of  a  closed-form  expression  for  the 
density  is  the  presence  of  an  intractable  expectation  in  the  definition  of  /(y|x,  6).  For 
example,  in  a  random  coefficients  model  it  may  be  difficult  to  integrate  out  the  ran¬ 
dom  parameters.  If  the  expectation  is  replaced  by  a  Monte  Carlo  approximation  the 
resulting  estimator  is  called  a  simulation-based  estimator.  A  similar  simulation  ap¬ 
proach  can  be  applied  to  method  of  moments  estimation  based  on  a  moment,  such  as 
the  conditional  mean,  for  which  there  is  no  closed-form  solution.  In  the  method  of 
moments  case  it  can  be  possible  to  obtain  consistent  parameter  estimates  with  much 
less  simulation  than  is  necessary  for  consistency  in  the  ML  case. 

These  estimation  methods  are  computer  intensive  because  they  make  extensive  use 
of  Monte  Carlo  sampling  methods.  Their  use  raises  questions  of  accuracy  of  approxi¬ 
mations,  efficiency  of  computation,  and  the  sampling  properties  of  the  estimators  that 
use  such  approximations. 

Section  12.2  gives  motivating  examples  for  simulation-based  estimation.  Sec¬ 
tion  12.3  covers  the  basics  of  computing  integrals,  as  an  expectation  with  respect 
to  a  continuous  random  variable  is  an  integral.  Sections  12.4  and  12.5  present  max¬ 
imum  simulated  likelihood  estimation  and  simulated  moment-based  estimation;  Sec¬ 
tion  12.6  deals  with  indirect  inference.  These  estimators  require  simulators,  detailed 
in  Section  12.7,  and  pseudo-random  numbers,  detailed  in  Section  12.8. 
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12.2.  Examples 

We  consider  examples  where  the  conditional  density  of  y  given  regressors  x  and  pa¬ 
rameters  0  is  an  integral 


/ ( v |x,  0)  =  J  h(y\x,  0 ,  u)g(u)du,  (12.1) 

where  the  functional  forms  of  h(-)  and  g(-)  are  known  and  u  denotes  a  random  variable, 
not  necessarily  an  error  term,  that  needs  to  be  integrated  out.  If  there  is  no  analytical 
solution  for  the  integral,  and  hence  no  closed-form  expression  for  the  likelihood  func¬ 
tion,  then  simulation-based  estimation  methods  are  warranted. 


12.2.1.  Random  Parameters  Models 

A  random  parameters  model  or  random  coefficients  model  permits  regression  pa¬ 
rameters  to  vary  across  individuals  according  to  some  distribution.  A  fully  parametric 
random  parameters  model  specifies  the  dependent  variable  y,  conditional  on  regres¬ 
sors  x,  and  given  parameters  7,  to  have  conditional  density  /(y,jx, ,  7,),  where  7,  are 
iid  with  density  g( 7,  |0).  Inference  is  based  on  the  density  of  y,-  conditional  on  x,  and 
given  0, 


f(y  |x,0)  =  //(y|x,7*(7l**7.  (12.2) 

This  integral  will  not  have  a  closed-form  solution  except  in  some  special  cases.  A 
common  specification  is  to  assume  normally  distributed  random  parameters,  with  7,  ~ 
A Tin,  £].  Then  7 (-  =  /x  +  X~1/2u,,  where  u,  ~  A/"[0, 1]  and  we  can  rewrite  (12.2)  in 
the  form  (12.1),  where  0  is  a  vector  containing  //  and  the  distinct  components  of  S, 
and  g(u)  is  the  A/"[0, 1]  density. 

A  simple  example  of  a  random  parameters  model  is  neglected  heterogeneity.  Then 
often  just  one  parameter,  usually  the  intercept,  is  assumed  to  be  random  and  the  integral 
is  a  one-dimensional  integral  that  is  easily  approximated  numerically.  More  generally, 
however,  the  dimension  of  the  integral  may  be  high. 

Leading  examples  of  random  parameters  and  unobserved  heterogeneity  include  (1) 
normally  distributed  random  parameters  in  multinomial  logit  models  (the  random  pa¬ 
rameters  logit  model;  see  Chapter  15),  (2)  gamma  distributed  unobserved  heterogene¬ 
ity  in  Weibull  duration  models  (see  Chapter  19),  (3)  gamma  distributed  unobserved 
heterogeneity  in  Poisson  count  data  models  (see  Chapter  20),  and  (4)  individual- 
specific  random  effects  in  panel  data  models  (see  Chapter  21).  Closed-form  solutions 
for  the  resulting  marginal  density  after  integration  over  the  distribution  of  heterogene¬ 
ity  are  available  in  example  3  and  for  the  linear  model  under  normality  in  example  4. 
However,  for  examples  1  and  2  and  many  nonlinear  applications  of  example  4  closed- 
form  solutions  are  not  available. 
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12.2.2.  Limited  Dependent  Variable  Models 

A  limited  dependent  variable  (LDV)  is  a  dependent  variable  that  is  observed  only 
over  part  of  its  range,  owing  to  censoring  and  truncation.  Then  the  density  of  the  ob¬ 
served  variable  involves  integrals  that  may  not  have  a  closed-form  expression. 

A  leading  class  of  LDV  models  are  discrete  choice  models,  detailed  in  Chapters 
14  and  15.  We  introduce  discrete  choice  models  here  because  they  have  been  the  focus 
of  the  econometrics  literature  on  simulation-based  estimation. 

As  an  example,  consider  consumer  choice  among  three  mutually  exclusive  alterna¬ 
tives,  such  as  among  three  different  durable  goods,  only  one  of  which  is  chosen  by 
the  individual.  Suppose  the  consumer  maximizes  utility,  and  let  the  utilities  of  alterna¬ 
tives  1,  2,  and  3  be  given  by  U\,  £/2,  and  U-$,  respectively.  The  utilities  U\,  f/2,  and  t/3 
are  not  observed.  Instead,  we  observe  only  a  discrete  outcome  variable  y  =  1,  2,  or  3 
depending  on  which  alternative  is  chosen. 

Suppose  alternative  1  is  chosen,  because  it  has  the  highest  utility.  Then  the  proba¬ 
bility  mass  function  is  p\  =  Pr[v  =  1],  where 


pi  =  Pr[Vi  -  U2  >  0,  Ui  -  U2  >  0] 

=  Pr[(xi  -  x2)73+ei  -  s2  >  0,  (x,  -  x3)'f3+si  -  e3  >  0], 


if  we  make  the  common  assumption  (see  Section  15.5.1)  that  Uj  =  x'/3  +  sj,  j  = 
1,  2,  3,  where  the  regressor  x  measures  the  different  attributes  of  the  three  goods  and 
the  error  s  can  range  over  (—00,  00).  Defining  u\  =  U \  —  U2  and  n2  =  U\  —  U3,  we 
have  that 


pi  =  I  I  g(u\,  U2)dii\du2,  (12.3) 

Jo  Jo 

where  g(u  1,  n2),  or  more  formally  g(u  1,  n2|x,  9),  is  the  bivariate  density  of  (u  1 ,  «2),  °r 
equivalently 


/oo  poo 

l[«i  >  0,  «2  >  0]g(Mi,  ui)du\du2,  (12.4) 

-OO  J  OO 

where  1  [A]  is  the  indicator  function  equal  to  1  if  event  A  happens  and  equal  to  0 
otherwise. 

The  integral  (12.4)  is  of  the  form  (12.1).  Because  the  integral  is  over  only  part  of 
the  range  of  (m,  u 2 )  (see  (12.3))  a  closed-form  solution  may  not  exist,  even  though  we 
know  that  f  f  g(u  | ,  uojdu  1  z/«2  =  1  if  integration  is  over  the  entire  range  of  (11  \ ,  n2). 

In  particular,  if  the  errors  e  are  normally  distributed,  as  in  the  multinomial  probit 
model,  the  integral  (12.3)  is  over  the  positive  orthant  of  a  bivariate  normal  distribution. 
There  is  no  closed-form  solution  for  p,  and  hence  no  tractable  expression  for  the  den¬ 
sity  f(y\x,  6)  exists.  In  practice  the  dimension  of  the  integral  can  be  very  high,  making 
numerical  approximation  difficult,  because  for  choice  among  m  mutually  exclusive  al¬ 
ternatives  the  integral  has  dimension  m  —  1.  Until  simulation-based  estimators  were 
developed  researchers  either  used  models  with  m  <  4  or  chose  other  error  distribu¬ 
tions  such  as  that  leading  to  the  much  more  restricted  multinomial  logit  model. 
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12.2.3.  ML  Estimation 

For  simplicity  consider  the  MLE.  Assume  independence  over  observations  and  that  y 
has  conditional  density  /(y|x,  9). 

The  complication  in  the  preceding  two  examples  is  that  ML  estimation  is  not  practi¬ 
cal  as  there  is  no  closed-form  expression  for  f(y  |x,  9),  which  is  defined  by  an  integral 
that  does  not  simplify.  Instead,  we  replace  the  integral  by  a  numerical  approximation 
f(y  |x,  6),  and  we  maximize 


N 

In L;v(0)  =  ^  In  /(>’,  |Xj ,  9) 

i= 1 

with  respect  to  9.  The  estimator  will  be  consistent  and  have  the  same  asymptotic  dis¬ 
tribution  as  the  MLE  if  f(y  |x,  9)  is  a  good  approximation  to  f(y  [x,  9). 

The  resulting  first-order  conditions  are  usually  nonlinear  and  are  solved  by  iterative 
methods.  Because  /(y,  |x,  ,  9)  varies  with  i  and  9,  evaluation  of  the  gradient  using 
numerical  derivatives  will  require  at  least  Nqr  evaluations,  where  N  is  the  sample 
size,  q  is  the  dimension  of  9,  and  r  is  the  number  of  iterations.  For  example,  with 
1,000  observations,  10  parameters,  and  50  iterations  there  are  at  least  500,000  function 
evaluations. 

This  standard  computational  demand  for  nonlinear  models  now  needs  to  be  mul¬ 
tiplied  by  the  number  of  evaluations  needed  to  compute  an  adequate  approximation 
to  the  integral  /(y|x,  9).  Clearly,  methods  that  require  relatively  few  evaluations  are 
desired. 


12.2.4.  Bayesian  Methods 

Bayesian  methods  are  given  a  separate  treatment  in  Chapter  13.  They  involve  compu¬ 
tation  of  integrals  that  appear  similar  to  (12.2),  but  they  go  one  step  further  and  obtain 
the  (posterior)  distribution  of  parameters  rather  than  a  point  estimate  such  as  the  MLE. 


12.3.  Basics  of  Computing  Integrals 


We  consider  the  integral 


/  =  f  f(x)  dx. 


(12.5) 


where  /(•)  is  continuous  on  [ a ,  h\.  and  the  bounds  of  the  integral  need  not  be  finite, 
so  a  =  —  oo  and/or  b  =  o o  are  possible.  In  this  section  x  is  initially  a  scalar  and  is 
used  to  denote  the  variable  being  integrated  out.  In  regression  applications  integration 
is  often  with  respect  to  a  vector  that  is  denoted  u  since  x  then  denotes  the  regressors 
(see  (12.1)).  It  is  assumed  that  the  integral  exists,  an  important  qualification  that  needs 
to  be  checked  as  approximation  methods  will  yield  a  finite  estimate  of  I  even  if  the 
integral  diverges. 
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We  first  present  numerical  integration  or  quadrature,  useful  for  low-dimensional 
integrals.  This  is  followed  by  Monte  Carlo  integration,  which  works  better  for  high¬ 
dimensional  integrals  and  is  the  focus  of  this  chapter. 

The  material  in  this  section  pertains  to  the  implementation  phase  of  simulation- 
based  estimation;  therefore,  some  readers  may  prefer  to  read  it  after  covering 
Sections  12.4-12.6. 


12.3.1.  Deterministic  Numerical  Integration 

An  integral  can  be  interpreted  as  an  area  or  a  volume  measure.  Deterministic  numer¬ 
ical  integration  or  quadrature  replaces  the  volume  by  a  series  of  slices  of  smaller 
volumes  that  are  then  added  up.  Formally  this  involves  evaluating  the  integrand  at  sev¬ 
eral  points  and  taking  a  weighted  sum  of  these  values.  The  prefix  deterministic  is  used 
to  indicate  that  this  method  of  approximation  of  an  integral  does  not  entail  simulation. 


Simpson’s  Rule 

By  the  definition  of  an  integral, 


n 

I  =  lim  f(Xj)Axj,  (12.6) 

x‘  7=1 

where  the  range  of  [a,  b\  of  x  is  split  into  (n  +  1)  points,  xq  <  x\  <  •  •  •  <  x„,  and 
n  — >  oo.  Standard  approximation  methods  are  refinements  of  (12.6)  that  provide  more 
accurate  approximations  for  finite  n.  We  present  results  for  equally  spaced  points, 
though  the  methods  can  be  generalized  to  evaluation  at  points  that  are  not  equally 
spaced.  For  simplicity  we  assume  that  f(x)  can  be  evaluated  at  the  limit  points  a 
and  b. 

The  midpoint  rule  evaluates  at  the  midpoint  xj  =  \(xj- 1  +  xj)  of  the  interval 
[jCy-_i ,  JCy]  and  sums  n  rectangles  that  have  base  (b  —  a)/n  and  height  f{xj).  Thus 
I  is  approximated  by 


~  CA  b  —  a 

/m=£ - fix,).  (12.7) 

7  =  1  H 

The  trapezoidal  rule  is  an  improvement  that  draws  a  straight  line  between  f(xj-i)  and 
f(Xj )  and  sums  n  trapezoids  that  have  base  (b  —  a)/n  and  average  height  (f(xj-i)  + 
f(xj))/1.  Thus  /  is  approximated  by 

?T  =  y  (fr  -  a) /(*7--i) +/(*;).  (12.8) 

4— f  n  2 

7=1 

Simpson’s  rule  uses  a  quadratic  curve  among  three  successive  points  f(xj-i),  f(xj), 
and  f(Xj+ 1),  whereas  the  trapezoidal  rule  used  a  line  between  two  successive  points. 
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This  leads  to  the  approximation 

-  "(b-a) 

=  (12.9) 

7=0 

where  n  is  even,  wj  =  4  if  j  is  odd,  and  Wj  =  2  if  j  is  even,  except  wq  =  w„  =  1 . 
Further  generalization  to  permit  a  polynomial  of  degree  p  among  p  +  1  successive 
points  is  possible. 

Error  bounds  for  these  approximations  increase  as  a  power  function  of  the  range 
of  integration,  b  —  a,  and  decrease  as  a  power  function  of  the  number  of  intervals. 
For  Simpson’s  rule,  |/g  —  I\  <  M^b  —  a)5/ 1 80«4,  where  M4  is  the  maximum  abso¬ 
lute  value  of  the  fourth  derivative  of  x  on  \a.  b\.  For  the  trapezoidal  rule,  |/T  —  I\  < 
Mi(b  —  ay'  / 1 2rr,  where  M2  is  the  maximum  absolute  value  of  the  second  derivative 
of  x  on  [ a ,  b\.  Clearly,  the  number  of  intervals  needs  to  increase  with  the  range  of  x, 
and  one  should  test  for  sensitivity  to  the  number  of  intervals. 

Simpson’s  rule  and  related  rules  can  work  well  for  definite  integrals  over  a  bounded 
interval,  but  problems  can  clearly  arise  with  indefinite  integrals  because  of  problems 
in  evaluating  in  the  tails.  For  example,  suppose  [a,b]  =  [0,  00).  Then  in  choosing 
x„  there  is  a  trade-off  because  the  upper  bound  xn  should  be  large,  but  then  the  dis¬ 
tance  between  evaluation  points  is  large.  At  the  least  one  should  test  for  sensitivity  to 
increases  in  x„. 


Gaussian  Quadrature 

Gaussian  quadrature,  where  quadrature  is  an  alternative  name  for  numerical  inte¬ 
gration,  was  proposed  by  Gauss  in  1814.  It  provides  a  rule  for  good  choice  of  the 
evaluation  points  Xj ,  no  longer  equally  spaced,  and  is  especially  useful  for  evaluating 
indefinite  integrals. 

We  first  reexpress  the  integral  (12.5)  as 

/  =  J  w(x)r(x)dx,  (12.10) 

where  w(x)  is  usually  one  of  the  following  three  functions,  depending  on  the  range 

2 

of  x:  Gauss-Hermite  quadrature  sets  w(x)  =  e~x  and  is  used  for  [c,  cl\  =  (—00,  00), 
Gauss-Laguerre  quadrature  sets  w(x)  =  e~x  and  is  used  when  [c,  d\  =  (0,  00),  and 
Gauss-Legendre  quadrature  sets  w(x)  =  1  and  is  used  when  [c,  d]  =  [—1,  1], 

In  the  simplest  case  (12.10)  can  be  obtained  from  (12.5)  by  defining  r(x)  = 
f(x)/w(x).  More  generally,  a  transformation  of  x  may  be  needed  so  that,  for  example, 
the  range  [2,  00)  in  (12.5)  becomes  [0,  00)  in  (12.10).  Some  routines  permit  the  user 
to  simply  provide  f(x)  and  the  range  of  integration  and  automatically  take  care  of  any 
necessary  transformations. 

Gaussian  quadrature  approximates  the  integral  (12.10)  by  the  weighted  sum 


Ai  =  J2  wAxi )- 

7=1 


(12.11) 


389 


SIMULATION-BASED  METHODS 


where  the  researcher  chooses  nr,  the  m  points  of  evaluation  Xj  and  the  weights  wj  are 
given  in  books  such  as  Abramowitz  and  Stegun’s  (1971)  or  in  computer  code  such  as 
that  provided  in  Press  et  al.  (1993). 

The  theory  behind  the  approximation  is  based  on  the  orthogonal  polynomials  of 
w(x),  denoted  Pj(x),  j  =  0, . . . ,  m,  that  satisfy 

w(x)pj(x)pk(x)dx  —  0,  j  f  k,  j,k  —  0,  . . . ,  m. 

If  additionally  f‘ 1  iv(x)pj(x)dx  =  1  then  the  polynomials  are  said  to  be  orthonormal. 
The  approximation  (12.11)  is  exact  if  r(x)  is  a  polynomial  of  order  2m  —  1  or  less, 
so  the  approximation  works  best  if  r(x)  in  (12.10)  is  well  approximated  by  a  polyno¬ 
mial  of  order  2m  —  1.  A  good  choice  of  the  number  of  evaluation  points  m  requires 
experimentation,  but  many  applications  use  m  no  more  than  20  or  30. 

As  an  example  consider  Gauss-Hermite  quadrature,  commonly  used  in  econo- 

2 

metrics  since  integration  is  often  over  (— oo,  oo).  For  w(x )  =  e~x~  the  orthogonal  poly¬ 
nomials  Pj(x)  are  the  Hermite  polynomials  Hj(x),  which  in  the  orthonormal  form  are 
generated  using  the  recursion  H/+\  (a)  =  *J2/{j  +  l)xHj(x)  —  J  j /(j  +  I )  Hj  ..  i  (x ), 
j  =  1 ,  . . . ,  m,  where  H  \  =  0  and  Hq  =  7T~1,/4.  The  m  abscissas  x,  are  obtained  as  the 
m  roots  to  H,„(x)  =  0  and,  for  orthonormal  Hermite  polynomials,  the  weights  Wj  = 
1  /[jHj-  i (xj)2].  As  already  noted  xj  and  wj  for  given  m  are  readily  available  in  tables 
or  computer  code. 

For  definite  integrals  Gauss-Legendre  quadrature  usually  performs  better  than 
Simpson’s  rule.  The  real  advantage  of  Gaussian  quadrature,  however,  is  for  indefi¬ 
nite  integrals.  Note  that  if  integration  is  over  (— oo,  oo)  it  may  be  possible  by  change 
of  variable  techniques  to  transform  to  an  integral  over  (0,  oo)  and  use  Gauss-Laguerre 
quadrature  rather  than  Gauss-Hermite  quadrature. 

There  are  many  additional  deterministic  methods  for  computing  integrals,  including 
Laplace  approximation  (Tierney,  Kass,  and  Kadane,  1989). 


12.3.2.  Integration  by  Direct  Monte  Carlo  Sampling 

Monte  Carlo  integration  provides  an  alternative  to  deterministic  numerical  integration. 
In  general  the  Monte  Carlo  integral  estimate  of  I  =  fb  f{x)  dx  is 

s 

4rc=  (12-12) 

S=1 

where  x1, ...  ,xs  are  S  uniform  draws  of  x  in  the  range  [a,  b\.  Compared  to  the  mid¬ 
point  rule  we  evaluate  f(x)  at  S  randomly  chosen  points  rather  than  n  deterministic 
midpoints. 

We  focus  on  regression  applications  such  as  those  given  in  Section  12.2.  Then 
integration  arises  because  we  wish  to  obtain  an  expected  value  E[/7(x)],  say,  where 
the  expectation  is  with  respect  to  a  random  variable  x  that  has,  say,  pdf  g(.r).  In  the 
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continuous  case  we  wish  to  evaluate 

E  [h(x)]=  f  h(x)g(x)dx ,  (12.13) 

J  a 

where  throughout  this  chapter  it  is  assumed  that  E[/;(a  )]  <  oo,  that  is,  the  integral  con¬ 
verges.  Then  E[/i(x)]  can  be  estimated  by  the  direct  Monte  Carlo  integral  estimate 

s 

/DMC  =  E  [*(*)]  =  S -1  h (*').  (12.14) 

5=1 

where  {xs,  s  =  1.  . . . ,  .S’)  is  a  Monte  Carlo  sample  of  S  pseudo-random  numbers  from 
the  density  g(x),  obtained  using  methods  given  later  in  Section  12.8.  The  estimate 
(12.14)  evaluates  h(x )  using  draws  of  x  from  the  density  g(x),  whereas  the  estimate 
(12.12)  evaluates  h(x)g{x)  using  uniform  draws  of  x  as  in  (12.12).  An  advantage  of 
(12.14)  is  that  it  can  be  applied  to  indefinite  integrals,  whereas  obtaining  uniform  draws 
in  (12.12)  is  problematic  if  the  limits  a  or  b  are  unbounded. 

The  estimate  E  [h(x)]  is  an  average  of  the  function  /(•)  evaluated  at  each  of  the 
random  draws  xs.  Equivalently,  E  [ h (x )  ]  is  an  average  of  the  random  variable  h(xs), 
and  its  properties  as  S  oo  can  be  obtained  if  we  can  apply  a  law  of  large  numbers 
and  a  central  limit  theorem.  Here  xs  is  iid,  so  h(xs)  is  iid  and  we  can  apply  Kolmogorov 
LLN  (see  Appendix  A,  Theorem  A.8)  since  the  existence  of  E[/z(x)]  has  already  been 
assumed.  It  follows  that 

E  [h(x) ]  4  E  [h(x)]  as  S  -»  oo. 

Also,  since  h(xs)  is  iid,  the  variance  of  E  [ h (x )  ]  equals  S  1 V [ h (x ) ]  assuming  V[/z(x)] 
exists.  The  approximation  is  likely  to  be  good  for  moderate  size  S  if  .S'  1  V[/z(.r')|  is 
small. 


12.3.3.  Integral  Computation  Example 
Suppose  x  ~  A/"  [0,  1],  and  we  wish  to  compute  the  mean 

/OO 

x  exp  (— x2/2)  dx 

-OO 

and  the  moment  E[exp(—  exp(x))],  defined  as  the  integral 

/OO 

exp  (—  exp(x))  exp  (— a2/2)  dx. 

-OO 

An  analytical  expression  for  E [a  ]  exists  and  yields  E[.v]  =  0.  By  contrast  an  analyt¬ 
ical  solution  for  E [exp  (—  exp(x))]  does  not  exist.  Before  seeking  a  numerical  approxi¬ 
mation,  we  first  confirm  that  the  integral  does  indeed  converge.  Since  exp  (—  exp(x))  is 
strictly  positive  and  monotonically  decreasing  with  maximum  value  of  1  it  follows  that 
|  exp  (—  exp(x))  [  <  1,  so  E[exp  (—  exp(x))]  <  E[l]  =  1  and  the  integral  converges. 

These  one-dimensional  integrals  are  easily  calculated  using  a  deterministic  numeri¬ 
cal  approximation.  For  example,  consider  using  the  midpoint  rule  with  n  =  20  equally 
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spaced  evaluations  between  xq  =  —5  and  X20  =  5.  Then 

^  _  20  JQ 

E  [x]  =  (Vlrc )-1  Y  exp  {-xj/2) , 
j= 1  U 

^  _  20  JQ  ^ 

E  [exp  (-  exp(x))]  =  (V^tt)-1  Y  ^  exP  (“  exP(*;))  exp  {-xj/2) , 


where  Xj  =  —5.25  +  j/2.  This  yields  E  [x\  =  0  to  many  decimal  places,  as  expected, 
whereas  E[exp(— exp(x))]  =  0.38175656.  The  latter  estimate  changes  little,  not  until 
the  eighth  decimal  place,  if  instead  we  do  n  =  200  evaluations  between  —10  and  10. 
Clearly  deterministic  numerical  approximations  work  well  here. 

These  integrals  are  also  easily  calculated  using  a  Monte  Carlo  approximation,  with 


E[x]  = 


E  [exp  (-  exp(x))]  = 


1  ^ 

^  Y exp  (-  exp('v'))  - 

5=1 


where  xs  is  the  5th  draw  of  S  draws  from  the  Af[0,  1]  distribution,  and  a  method 
to  make  such  draws  is  given  in  Appendix  B.  Table  12.1  gives  estimates  of  E  [x] 
and  E[exp(—  exp(x))]  for  various  numbers  of  simulations  S.  Observe  the  tendency 
of  the  estimators  to  stabilize  as  S  00,  and  to  go  to  their  respective  true  values  of 
0  and  0.38175656,  where  the  latter  is  obtained  by  deterministic  numerical  approxi¬ 
mation.  However,  even  with  S  =  106  the  estimate  E  [x]  still  differs  from  zero  in  the 
fourth  decimal  place.  Here  V[  E[x]]  =  S  1  V[x'  ]  =  1  / S  since  V[x  v  ]  =  1,  so  even  with 
S  =  106  the  standard  deviation  of  E  [x]  is  a  relatively  large  0.001 .  Alternative  methods 
that  yield  a  Monte  Carlo  approximation  with  lower  variance  are  given  in  Section  12.7. 


Table  12.1.  Monte  Carlo  Integration:  Example  for  x 
Standard  Normal 


S  =  Number  of  simulations 

E[x] 

E  [exp  (-exp(x))] 

10 

0.145 

0.336 

25 

-0.209 

0.435 

50 

0.050 

0.369 

100 

-0.120 

0.409 

500 

-0.059 

0.398 

1,000 

0.005 

0.382 

10,000 

-0.007 

0.383 

100,000 

-0.000 

0.382 

1,000,000 

-0.000 

0.381 
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12.3.4.  Higher  Dimensional  Integrals 

Higher  dimensional  integrals  can  be  evaluated  using  either  deterministic  or  Monte 
Carlo  integration,  with  the  latter  method  preferred  as  the  dimension  increases. 

Deterministic  integration  is  best  done  using  multivariate  Gaussian  quadrature  or,  if 
the  limits  of  integration  are  not  too  complicated,  by  reducing  an  m -dimensional  inte¬ 
gral  to  a  series  of  m  one-dimensional  integrals  evaluated  using,  say,  Gaussian  quadra¬ 
ture.  However,  from  the  definition  of  the  integral  in  (12.6)  it  is  clear  that  the  number 
of  evaluations  will  have  to  go  up  by  the  power  m.  For  example,  if  20  function  evalua¬ 
tions  are  needed  for  a  one-dimensional  integral,  then  a  five-dimensional  integral  may 
require  520  or  95  trillion  function  evaluations.  Such  high  precision  may  not  be  needed 
in  an  estimation  setting  where  similar  computations  are  being  done  for  each  individ¬ 
ual  observation  and  then  summed,  but  even  then  the  number  of  evaluations  will  rise 
substantially  with  the  dimension  of  the  integral. 

Performing  Monte  Carlo  integration  in  higher  dimensions  is  straightforward:  Just 
define  x  in  (12.13)  and  (12.14)  be  a  vector,  and  make  draws  from  the  multivariate  den¬ 
sity  g(x).  There  is  apparently  no  curse  of  dimensionality.  One  should  bear  in  mind, 
however,  that  simple  Monte  Carlo  integration  will  not  work  if  the  integrand  is  strongly 
peaked,  and  it  is  possible  that  such  peaks  may  become  more  prominent  in  higher  di¬ 
mensions.  In  particular,  for  the  discrete  choice  example  in  Section  12.2.2  the  integrand 
in  (12.4)  may  be  nonzero  over  only  a  small  part  of  the  range  of  (u,  v),  a  complication 
pursued  in  Section  12.7.  Moreover,  drawing  from  a  multivariate  distribution  can  be 
more  difficult  than  drawing  from  a  univariate  distribution. 


12.4.  Maximum  Simulated  Likelihood  Estimation 

We  now  consider  application  of  these  ideas  to  ML  estimation  when  no  analytical  ex¬ 
pression  is  available  for  the  density.  The  key  result  is  that  simulation  can  lead  to  an 
estimator  with  the  same  distribution  as  the  MLE,  provided  that  the  number  of  simula¬ 
tion  draws  made  to  compute  the  density  for  each  observation  goes  to  infinity. 


12.4.1.  Simulators 

Suppose  the  conditional  density  f(y\x,  9)  for  an  observation  involves  an  intractable 
integral.  Specifically,  suppose  that,  as  in  (12.1), 

f(yi\Xi,0)=  J  h(yi\xj,8,Ui)g(Ui)dUi,  (12.15) 

which  needs  to  be  estimated  if  there  is  no  closed-form  solution. 

The  direct  simulator  for  /(y,  |x,  ,  9 )  is  the  obvious  Monte  Carlo  integral  estimate 

1  s 

./'(>’,|x(,  u„S',  0)  =-Y^h(yi\Xi,  9,  u-),  (12.16) 

5=1 
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where  u,.y  is  a  vector  of  S  draws  u'J,  .y  =  1, . . . ,  S,  that  are  independent  draws  from 
g(u,).  This  simply  averages  /? (y,- 1 x, ,  0 ,  u-)  over  the  S  draws.  From  Section  12.3.2,  /, 
is  unbiased  for  f]  and  is  consistent  for  /,  as  the  number  of  draws  S  — »■  oo. 

Simulators  other  than  the  direct  simulator  can  be  used,  and  these  are  detailed  in  Sec¬ 
tion  12.7.  These  can  yield  an  estimate  /,■  that  better  approximates  f,  for  a  finite  number 
of  draws  by,  for  example,  permitting  correlation  among  the  draws  provided  they  still 
have  marginal  distribution  g(u,).  More  generally,  then,  a  simulator  for  /(y,  |x,  ,  0)  is 
a  Monte  Carlo  estimate 

ls~ 

/(}ilxi>  u,s-  0)  =-  ^2  f(y,\Xi,  9,  u;5),  (12.17) 

kJ  , 

5=1 

where  u-  ,  .9  =  1, . . . ,  S,  are  S  draws  with  marginal  density  g(u,)  but  not  necessarily 
independent  over  s.  To  be  useful  the  simulator  /,  f]  as  S  — »■  oc.  This  is  likely  if 
the  subsimulator  /(•)  is  an  unbiased  simulator  with  the  property  that 

E  [7(y  |x,  0,  u1)]  =  /(y|x,  0).  (12.18) 

A  desirable  property  of  a  simulator  is  that  /,  be  differentiable  in  0,  so  that  stan¬ 
dard  iterative  gradient  methods  can  be  used  to  compute  the  estimate  of  9.  To  elimi¬ 
nate  “chatter”  caused  by  simulation  and  ensure  numerical  convergence,  the  underlying 
Monte  Carlo  draws  used  to  construct  /,  should  not  be  redrawn  as  9  changes  across 
iterations. 


12.4.2.  MSL  Estimator 

Given  independence  over  i,  the  maximum  likelihood  estimator  0ml  maximizes 
In  L\  ( 9 )  =  In  /(yr  |x,  ,  9).  The  maximum  simulated  likelihood  (MSL)  estima¬ 

tor  0msl  instead  maximizes  the  log-likelihood  based  on  a  simulated  estimate  of  the 
density,  or 

N 

In L,v  (0)  =  ^ln/Cydx;,  UiS,  9),  (12.19) 

1  =  1 

where  the  simulator  /(•)  is  defined  in  (12.17).  If  /(•)  is  differentiable  in  0  then  0msl 
can  be  computed  using  the  standard  gradient  methods  of  Chapter  10,  with  either  ana¬ 
lytical  or  numerical  derivatives  used. 


12.4.3.  Distribution  of  the  MSL  Estimator 

From  the  general  consistency  proof  method  outlined  in  Section  5.3.2,  the  MSL  esti¬ 
mator  will  have  the  same  probability  limit  as  the  ML  estimator  if  the  approximating 
objective  function  N  ~ 1  I n  L  v  ( 0 )  has  the  same  probability  limit  as  the  original  objec¬ 
tive  function  N  1  In  L  \  (0).  This  occurs  if  In  /,  —  In  /,  -a-  0,  which  in  turn  happens  if 
fi  -  ft  0  as  S  oo. 

Even  if  the  MSL  estimator  is  consistent,  it  is  possible  that  simulation  error  will  in¬ 
flate  the  variance  of  the  MSL  estimator  compared  to  the  ML  estimator.  As  an  example 
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of  a  formal  statement  of  conditions  under  which  the  MSL  estimator  is  fully  efficient 
we  give  the  following  proposition,  which  is  a  rephrasing  of  a  theorem  in  Gourieroux 
and  Monfort  (1991). 


Proposition  12.1  (Distribution  of  MSL  Estimator)  (Gourieroux  and  Monfort 
1991):  Assume  the  following: 


(i)  The  data  are  from  a  simple  random  sample  from  a  dgp  with  conditional  density 
/(y|x,  9<f)  that  satisfies  the  regularity  conditions  so  that  the  ML  estimator  is 
consistent  and  asymptotically  normal  with  limit  variance  matrix  At1(9q),  where 


0o-l 

(ii)  The  density  f  is  estimated  using  the  simulator  f  in  (12.17)  with  f  unbiased 
for  f 


A(9q )  =  — plim 


N 


-l 


E 

i= 1 


92ln  /(y;|x,-,  9) 
8989' 


Then  the  maximum  simulated  likelihood  estimator  defined  in  (12.19)  is  asymptoti¬ 
cally  equivalent  to  the  ML  estimator  if  S.  N  — >  oo  and  \/N /S  — >  0,  and  it  has  a  limit 
normal  distribution  with 

VN{9MSh  -  90)  4  AT  [0,  A-'tfo)]  ■  (12.20) 


The  MSL  estimator  is  actually  consistent  under  the  weaker  condition  that  S.  N  — >• 
oo.  This  is  satisfied  if,  for  example,  S  =  Nl)A/a  for  some  constant  a.  However,  then 
Vn/S  =  aN0A  — >  oo,  so  the  MSL  estimator  is  not  fully  efficient  according  to  Propo¬ 
sition  12.1.  By  the  usual  first-order  Taylor  series  expansion  the  limit  distribution  of 
4(V(#msl  —  0o)  is  a  matrix  multiple  of  N~1/2  JT  31n/,-/30|fl  ,  which  depends  on 
both  variability  of  din  f/ 89  and  simulation  error  in  the  approximation  /,.  Proposi¬ 
tion  12.1  says  that  for  this  simulation  error  to  disappear  asymptotically  the  number  of 
draws  S  must  increase  with  sample  size  at  rate  in  excess  of  s/N. 

The  variance  matrix  of  the  MSL  estimator  requires  estimation  of  A(0q).  It  is  eas¬ 
iest  to  use  a  simulated  variant  of  the  BHHH  estimate  defined  in  Section  5.5.2.  Since 
3  In  f  /89  =  ( df/89 )  //,  ,  the  BHHH  estimate  for  the  information  matrix  is 

g_  1  y8fi(9)/89  8fi(9)/89' 

Nh  fiW)  M°) 


Because  there  is  no  closed-form  solution  for  f  and  df/89  this  expression  cannot 
be  computed.  So  we  replace  f  by  the  simulator  /,  defined  in  (12.17),  yielding  the 
simulated  estimate  of  the  asymptotic  variance 


V[?msl]  = 


I  Eti  df,s(0)/d9  Eti  d~f/(9)/89' 

V  Eti  f,m  eL i  fi& 


-1 


(12.21) 


where  f/(9)  =  /(y,-|x, ,  uf  0msl)-  Alternative  estimates  of  the  variance  matrix  can  be 
obtained  by  similar  adaptation  of  the  Hessian  estimate  and  sandwich  estimates  defined 
in  Section  5.5.2. 
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An  important  practical  issue  concerns  the  number  of  simulations.  One  can  increase 
the  number  of  simulations  as  the  sample  size  increases,  but  the  level  or  the  absolute 
value  of  S  remains  indeterminate.  If  there  is  little  difference  in  the  estimates  using 
2,400  simulations,  say,  rather  than  2,600,  then  we  might  take  this  as  an  indication  that 
2,400  simulations  is  an  adequate  number.  Suppose  now  that  the  sample  increases  four 
fold.  By  how  much  should  we  increase  the  number  of  simulations?  Proposition  12.1 
suggests  that  we  should  more  than  double  S  to  more  than  4,800,  so  that  the  ratio  \/N / S 
decreases  toward  zero.  However,  notice  that  in  this  case  we  may  not  be  sure  if  */~N /S, 
here  equal  to  1/30  if  S  =  2,400  and  N  =  6,400,  say,  is  sufficiently  close  to  zero.  So 
the  question  of  whether  one  has  done  enough  simulations  is  difficult  to  answer.  Many 
practitioners  rely  on  rough  indicators  of  convergence  of  point  estimates,  informally 
based  on  checking  the  gradients  of  L  y  (0).  A  formal  test-based  approach  to  choosing 
S  is  discussed  in  Hajivassiliou  (2000). 


12.4.4.  Asymptotic  Bias-Adjusted  MSL 

The  MSL  estimator  is  inconsistent,  or  asymptotically  biased,  when  the  number  of  sim¬ 
ulations  S  <  oo.  This  bias  arises  for  finite  S  because  In  /,  is  biased  for  In  /,  even  if 
the  simulator  /,  is  unbiased  for  f) ,  as  the  consequence  of  taking  the  natural  logarithm. 
Thus  N  'inLy  (0)  and  N  1 1 n  L  y  (0)  have  different  probability  limits  for  finite  S.  This 
motivates  a  search  for  alternative  simulation-based  estimators,  since  we  can  never  set 
S  =  oo  and  it  may  be  computationally  expensive  to  set  S  to  be  large. 

The  obvious  approach  is  to  find  an  unbiased  simulator  for  the  log-density  In/}, 
rather  than  for  /},  but  in  practice  this  is  not  possible.  Instead,  in  this  section  we  present 
a  bias-corrected  version  of  MSL,  and  in  the  following  section  we  present  an  alternative, 
less  efficient  estimator  than  MSL  that  is  consistent  for  finite  S. 

Gourieroux  and  Monfort  (1991)  give  an  expression  for  the  bias  of  the  MSL  estima¬ 
tor.  The  inconsistency  of  the  MSL  estimator  for  fixed  S  comes  from  the  fact  that  then 
In  /  is  an  inconsistent  estimator  of  In  /  .  A  way  of  reducing  the  inconsistency  is  to  use 
a  bias-adjusted  log-likelihood  function.  Write 

In  /  =  ln[/  +  (/  —  /)]■ 


Taking  a  second-order  Taylor  expansion  around  In  /  yields 


In  / 


In  /  + 


f 


1  (7-/)2 

2  r- 


Integrating  with  respect  to  the  density  of  u,  and  solving  for  In  /,  yields 

-  1  Eu[(/ —  /)2] 

ln/~Eu[ln/]  +  -  ,  (12.22) 

2  J~ 

assuming  f  is  an  unbiased  simulator  so  that  Eu[/]  =  /.  This  expression  makes  it  clear 
that  a  simulator  f  with  small  variance  leads  to  lower  bias. 

A  bias-corrected  estimator  uses  an  adjusted  log-likelihood  based  on  the  right-hand 
side  of  (12.22).  For  the  simulator  (12.17),  /  equals  S_1  fs  and  Eu[(/  —  f)2] 
equals  S'-1  ^/sEu[t/v  —  f)2].  Given  draws  independent  over  5  the  latter  can  be 
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approximated  by  S  1  ^2s(fs  —  /)2.  Then  (12.22)  yields  the  first-order  asymptotic 
bias-corrected  MSL  estimator,  0bcmsl,  which  maximizes 


N 

lntB,N(6)=J2 

i=  1 


lnf(yi\Xj,uiS,6)  + 


1  E.f=  I  [/O'/.  */.  0)  -  /(y,-|xf,  U/5.  0)]~ 

2^  /(y.-IXj.Uis,  0)2 


where  /(y,  |x,  ,  u,.$,  0)  =  .S'  1  Es  /(y,  ,  x, ,  u- ,  0).  The  usefulness  of  this  bias-reduction 
technique  will  vary  from  case  to  case,  as  the  assumption  that  bias  is  small  may  not 
always  hold. 


12.4.5.  Unobserved  Heterogeneity  Example 

Suppose  that  y,  ~A/’[0,-,  1],  where  the  scalar  parameter  0,  varies  across  individuals 
with  =  6  +  iij,  with  n,  representing  unobserved  heterogeneity  that  is  assumed  to 
have  a  known  distribution.  The  density  of  y  conditional  on  u  is  simply 

f(y\u,  0)  =  -L  exp  {-(y  -  0  -  n)2/2}  .  (12.23) 

\/2tt 

However,  inference  on  6  needs  to  be  based  on  the  marginal  density  of  y  (i.e.,  marginal 
with  respect  to  u),  which  requires  integrating  out  u.  Here  we  assume  that  u  has  density 

g(u)  =  e~u  exp(— e_“),  (12.24) 

a  skewed  distribution  that  has  nonzero  mean  and  for  simplicity  does  not  depend  on 
unknown  parameters. 

Maximum  likelihood  estimation  is  not  possible  as  the  marginal  density  /(y|0), 
which  equals  f  f(y\9,  u)g{u)du,  has  no  closed-form  solution.  We  instead  use  the  MSL 
estimator  using  the  the  direct  simulator  in  (12.16),  so  that  0msl  maximizes 

-  1  N  (\  s  l  \ 

lnLjv(0)  =  -T  In  -  Y  —  exp  {— (y,  -  9  -  / 2}  ,  (12.25) 

N  i= 1  \6  S=1  V2t r  / 

where  «■',  s  =  1, . . . ,  S,  are  draws  from  the  extreme  value  density  g(n,)  in  (12.24). 
The  MSL  estimator  0msl  is  the  solution  to  the  first-order  conditions 

ainLjvie)  1  y^Ef=i(y»--^-“/)exp{-(yf-0-Hf)2/2} 

90  N {rt  Ef=iexp{-(y,-0-<)2/2}  ’ 

upon  some  simplification.  There  is  no  closed-form  solution  for  0,  but  standard  iterative 
methods  can  be  used  to  compute  0Msl- 

Consistency  of  the  MSL  estimator  requires  the  number  of  draws  S  — >  oo,  in  addi¬ 
tion  to  the  usual  sample  size  N  -a-  oo,  so  the  method  is  potentially  computationally 
intensive.  The  MSL  estimator  is  then  asymptotically  normally  distributed  as  usual, 
with  asymptotic  variance  most  easily  estimated  using  the  BHHH  estimator  (12.21), 
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Table  12.2.  Maximum  Simulated  Likelihood  Estimation:  Example 


Number  of  Simulations 

S=  1 

S=  10 

S=  100 

S  =  1,000 

S  =  10,000 

MSL  estimate  9 

Standard  error 
lnL(?) 

1.0416 

(.0968) 

-136.31 

1.0594 

(.1093) 

-174.38 

1.1775 

(.1453) 

-190.44 

1.1845 

(.1448) 

-192.43 

1.1828 

(.0091) 

-192.35 

which  yields 


V[?msl]  =  JZ 


J2s=l(yi  ~  °MSL  -  U?)  exp  {  —  ( V,  -  0MSL  “  t/f  )2/2} 
Etl  eXP  {— (y.  -  @MSL  -  M-)2/ 2} 


-1 


(12.27) 


This  estimator  is  fully  efficient. 

To  illustrate  we  consider  a  sample  { vi , . . . ,  Vioo}  of  size  N  =  100  generated  from 
the  model  of  (12.23)  and  (12.24)  with  6  =  1.  Table  12.2  gives  estimates  as  the  number 
of  draws  S  increases.  For  small  S  the  MSL  estimator  is  inconsistent.  By  S  =  10,000 
the  estimator  $msl  has  stabilized,  though  the  estimated  standard  error  bounces  around 
quite  a  bit.  The  simulated  log-likelihood  decreases  as  S  increases  but  eventually  sta¬ 
bilizes.  This  decrease  is  expected  as  the  simulator  is  unbiased  for  f(y\6)  but  is  biased 
upward  for  In  f(y\6)  since  by  Jensen’s  inequality  lnE[/(y|@]  >  E[ln  f(y\9]  because 
the  natural  logarithm  function  is  globally  concave;  see  Appendix  A  (Section  A.8). 


12.5.  Moment-Based  Simulation  Estimation 

The  simulation  approach  to  estimation  when  there  is  no  closed-form  expression  for 
the  objective  function  can  be  extended  to  estimators  other  than  the  MLE.  Furthermore, 
in  some  cases  it  is  possible  to  obtain  consistent  parameter  estimates  with  only  a  few 
simulations  per  observation,  though  there  is  then  an  efficiency  loss. 


12.5.1.  Simulated  m-Estimators 

Consider  an  m-estimator  that  has  as  its  objective  function  (see  Section  5.2.2) 

1  N 

Qn  (0)=  x<  -  #)• 

i=l 

Maximum  likelihood  is  the  special  case  q(y,  x,  6)  =  In  /'(y  [x.  6). 

Suppose  there  is  no  closed-form  expression  for  q( •),  but  a  simulated  estimate  is 
available.  Then  a  simulated  m-estimator  minimizes 

1  N 

Qn  (6)  =  -  JZ  q(y>’  x-  “-'S’  6 )•  (12-28) 
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where,  similar  to  Section  12.4.1,  <7,  is  an  estimate  of  q,  based  on  a  vector  u,.y 
of  S  draws  u  ■ ,  s  =  I ,  S.  from  an  appropriate  distribution.  Usually,  'q(-)  = 
S  1  lx/-  0,  u-  ),  where  u-  is  the  .vth  draw. 

The  simulated  m-estimator  will  be  consistent  if  the  m-estimator  is  consistent  and 
additionally 

p\imQN(0)  =  plimQN(d),  (12.29) 

since  from  Section  5.3  the  necessary  condition  for  consistency  of  the  original  m- 
estimator  is  that  plim  QN  ( 6 )  is  maximized  at  0  =  60.  Here  the  first  plim  is  with  re¬ 
spect  to  all  stochastic  variables,  including  the  simulated  draws  u,y,  whereas  the  second 
plim  does  not  depend  on  11,5. 

Condition  (12.29)  is  satisfied  if  the  simulator  is  such  that  q)  —  q,  — >  0  as  S  — >  00, 
since  then  N  1  q)  —  N  1  JT  ch  — >  0.  This  was  the  assumption  made  in  Sec¬ 
tion  12.4.  Furthermore,  the  simulated  m-estimator  should  have  the  same  limit  dis¬ 
tribution  as  the  m-estimator  if,  as  in  Section  12.4,  S  increases  with  sample  size  so  that 
\Z~N /S  — >  0.  This  requires  many  simulations. 


12.5.2.  Reducing  the  Number  of  Simulations 

Now  suppose  the  simulator  %  is  not  only  consistent  but  is  unbiased.  Then  by  applica¬ 
tion  of  a  law  of  large  numbers,  and  for  simplicity  suppressing  stochastic  variables  other 
than  the  simulated  draws,  plim  QN  (6)  =  lim  /V  ~ 1  JT  EU  s [cj, ]  =  lim  /V  ~ 1  JT  qt  = 
plim  Qn  ( 0 )  and  condition  (12.29)  is  satisfied.  Thus  the  simulated  m-estimator  is  con¬ 
sistent  with  as  little  as  one  draw  of  u,  per  observation,  provided  EU  v [  q)  ]  =  qt. 

Unfortunately,  this  result  is  difficult  to  implement,  as  in  applications  it  is  rarely 
possible  to  find  an  unbiased  simulator  for  qt .  For  example,  with  ML  estimation  it  can 
be  possible  to  find  an  unbiased  simulator  for  the  density  /,  ,  but  it  is  not  possible  to 
find  an  unbiased  simulator  for  In  ft .  Similarly,  for  NLS  estimation  it  can  be  possible 
to  find  an  unbiased  estimator  for  the  conditional  mean,  but  it  is  not  possible  to  find  an 
unbiased  simulator  for  the  squared  error,  which  involves  the  square  of  the  conditional 
mean. 

In  some  cases  this  result  can  be  implemented,  however,  if  the  estimator  is  a  method 
of  moments  or  GMM  estimator  rather  than  an  m-estimator. 


12.5.3.  Method  of  Simulated  Moments 
Suppose  theory  leads  to  a  conditional  moment  condition 


E[m(y,-,x(-,0o)|x,]  =  O,  (12.30) 

where  m(-)  is  a  scalar  for  simplicity.  Let  w,  denote  instruments,  a  function  of  x,  and 
possibly  6q,  that  satisfy 


E[w/w(y;,  x/,  0O)1  =  0. 


(12.31) 


399 


SIMULATION-BASED  METHODS 


The  method  of  moments  estimator  0mm  (see  Chapter  6.3.1)  minimizes 


Qn  ( e ) 


1  N 

—  im(yi,xh  0) 
i= 1 


l  N 

—  ^w,m(>’M  x,.  0) 
/  =  1 


(12.32) 


where  for  simplicity  thejust-identihed  case  that  dim[w,  ]  =  dim[0]  is  assumed.  Results 
do  generalize  to  the  overidentified  case,  but  the  notation  is  more  cumbersome  as  a 
weighting  matrix  then  needs  to  be  introduced  and  estimation  is  by  GMM. 

The  method  of  moments  estimator  is  consistent  and  has  limit  normal  distribution 
with  variance  matrix  that  depends  in  part  on  the  choice  of  instruments  w, .  An  exam¬ 
ple  is  nonlinear  regression,  where  m(y,  x,  0)  =  y  —  E[y|x]  is  the  error  term  and  the 
conditional  mean  E[y|x]  is  a  specified  function  of  x  and  0.  Then  the  best  choice  of  in¬ 
strument  is  w  =  3E[y|x]/90L  if  the  error  is  homoskedastic,  since  then  the  method  of 
moments  estimator  has  the  same  first-order  conditions  as  those  for  the  NLS  estimator. 

Now  suppose  there  is  no  closed-form  expression  for  m(y ,  x,  0).  For  example,  a  non¬ 
linear  regression  model  may  lack  a  closed-form  expression  for  the  conditional  mean. 
Instead,  m(y,  x,  0)  is  an  integral 

m(yt ,  X; ,  0)  =  J  h(yi,xi,ui,9)g(ui)dui,  (12.33) 

for  some  functions  h(-)  and  g(-),  that  has  no  closed-form  solution.  Obtaining  a  method 
of  moments  estimator  is  no  longer  feasible. 

The  method  of  simulated  moments  (MSM)  estimator  0msm  instead  minimizes 


'  1  ^ 

! 

'  1  N  _ 

Qn  (0)  = 

—  Y.  Wiiniyi,  x(,  u,s,  0) 
.N  i= 1 

—  E  w,m(>'M  X,,  u iS,  0) 
.Jx  (=1 

(12.34) 


where  /?;(>>,,  x,.  u,.y,  0)  is  an  unbiased  simulator  for  m (_y, ,  x, ,  0 )  that  satisfies  the 
condition 


E [m(yi ,  x; ,  u iS,  0)]  =  m(y, ,  x,- ,  0),  ( 12.35) 

and  u,5  denotes  S  draws  from  the  marginal  density  «(u,)  and  S  >  1.  Examples  of  m, 
and  unbiased  simulator  m;  are  given  in  the  following. 


12.5.4.  Distribution  of  MSM  Estimator 

The  MSM  estimator  was  proposed  by  McFadden  (1989),  who  proved  the  following 
properties  for  the  estimator. 


Proposition  12.2  (Distribution  of  MSM  Estimator)  (McFadden  1989):  As¬ 
sume  the  following: 

(i)  The  data  are  from  a  simple  random  sample  from  a  dgp,  where  m(y,  x,  0  o)  has  zero 
conditional  expectation  as  in  (12.30)  and  Wjm(y ,  x,  0o)  has  zero  unconditional 
expectation  as  in  (12.31)  and  assumptions  are  satisfied  so  that  the  MM  estimator 
that  minimizes  (12.32)  is  consistent  and  asymptotically  normal. 

(ii)  The  function  m(y,  x,  0o)  is  defined  by  (12.33)  and  is  estimated  using  the  unbiased 
simulator  m(y ,  x,  Of)  that  satisfies  (12.35). 
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Then  with  S  fixed  the  method  of  simulated  moments  estimator  that  minimizes 
(12.34)  is  consistent  and  asymptotically  normal  as  N  — »•  oo  and  has  a  limit  normal 
distribution  with 


Va(0Msm  -  0o)  4  M  [0,  A~ 1  (0o)B(0q)A~  1  (6q )'] ,  (12.36) 


where 


\  v  ^  3 rrii(G) 

A(0o)  =  plim-^w;  -^r- 


1  =  1 


and 


1  N 

B(0O)  =  plim  —  22  W;  V[/w,(0o)]w- 
1  =  1 


(12.37) 


(12.38) 


wit/i  ffie  variance  V[-]  being  with  respect  to  both  the  conditional  distribution  of  y,- 
given  Xj  and  the  draws  u,s  given  after  (12.35). 


Before  giving  a  derivation  for  this  proposition  we  note  the  following.  First,  the 
MSM  estimator  has  the  remarkable  property  of  being  consistent  even  if  S  =  1 .  Second, 
there  is  an  efficiency  loss  for  finite  S.  The  variance  matrix  for  0mm  is  the  same  as  that 
for  0msm,  except  that  for  MM  estimation  V[m,  ]  in  (12.38)  is  replaced  by  the  smaller 
V[m,].  Third,  the  efficiency  loss  caused  by  simulation  disappears  as  S  oo,  since 
then  V[?n,]  =  V[«;,].  Fourth,  as  for  MM  estimation,  the  MSM  estimator  with  S  — »■  oc 
may  be  inefficient  compared  to  other  estimators  if  the  instruments  w  are  poorly  chosen. 

Consistency  of  the  MSM  estimator  requires  that  condition  (12.29)  is  satisfied  for 
Qn  (0)  and  Qn(9)  given  in  (12.34)  and  (12.32).  By  a  law  of  large  numbers 

j  Af  N 

plim  —  22  =  Plim  22  w'Eu,s [mi], 

™  i=i  i=i 

where  the  first  plim  is  with  respect  to  all  stochastic  variables  whereas  the  second 
plim  is  with  respect  to  all  stochastic  variables  aside  from  the  simulated  draws  u.  Flere 
EU/S[m,  ]  =  in;  since  m ,•  is  an  unbiased  simulator,  so 

1  N  N 

plim  —  22  w,  m;  =  plim  A-1  22  w 

”  i=i  /=i 


This  in  turn  implies  that  plim  QN  (0)  =  plim  QN  (0).  So  0msm  is  consistent,  provided 
0o  maximizes  plim  Q  ;v  (0),  which  is  necessary  for  the  original  MM  estimator  to  be 
consistent. 

For  the  limit  distribution,  differentiating  Qn(Q)  with  respect  to  0  yields 


(^f> 


9m,(0)\ 

90'  ) 


1  N 

—  ^w;m,(0)  =  0. 

i=l 
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The  first  matrix  is  a  full-rank  square  matrix,  so  equivalently  0msm  satisfies  the  first- 
order  conditions 


1  N  - 
—  ^  w,m,(0)  =  0, 

1  =  1 


where  ra*(0)  =  fhiiyi ,  x,- ,  u;s,  0).  By  the  usual  exact  first-order  Taylor  series  expan¬ 
sion  about  6q 


N 


N 


Y2  WiMiiOo)  +  W' 
1=1  1  =  1 


3  m,(0) 

30' 


(0  -  0O )  =  0, 

0* 


and  hence 

ViV(0  -  0O )  = 


W; 


drhi(0) 


i=i 


90' 


-1 


N 

Ar1/2J>,n,.(0o). 

(=i 


Now  Eu  [3«i(0)/30]  =  3EU  [m(0)]  /30  =  dm(9)/d0,  so  the  first  matrix  on  the  right- 
hand  side  converges  to  A(0o)  given  in  Proposition  12.2.  The  second  term  on  the  right- 
hand  side  has  a  limit  normal  distribution  with  mean  zero  and  variance  matrix 


1  N 

B(0O)  =  plim  —  ^2  w,V[m((0o)]w', 

N  i=i 

as  in  Proposition  12.2,  where  V[wi,(0o)]  is  a  variance  with  respect  to  both  u,y  and  the 
distribution  of  y,  given  x, . 

Since  u/s  is  independent  of  y,  we  have 


Vy,u[m(0o)]  =  Vv  [Eu  [m(0o)]]  +  Ev  [Vu  [m(0o)]] 
=  Vv  [m(0o)]  +  Ey  [V„  [m(0o)]] . 


Substitution  yields  a  more  detailed  definition  of  B(0o)  given  in  Proposition  12.2. 

Simulation  inflates  the  variance  of  the  MSM  estimator  because  of  the  term 
Ey[Vu[m(0o)]l,  which  goes  to  zero  as  S  — >  oo.  In  the  special  case  that  the  simula¬ 
tor  is  the  frequency  simulator,  it  can  be  shown  that  Vv  ll[m(0o)J  =(l  + 1  / S)V y  [m(0o)], 
so  that  the  effect  of  simulation  using  the  frequency  simulator  is  to  inflate  the  variance 
of  the  MM  estimator  by  (1  +  ( I /S)) ! 


12.5.5.  Choosing  between  MSM  and  MSL 

The  practitioner  will  weigh  the  pros  and  cons  of  MSL  versus  MSM.  Given  that  MSM 
is  consistent  for  small  S,  and  further  given  the  difficulty  of  ensuring  that  one  has  set  S 
at  a  large  enough  value  to  ensure  a  good  approximation  to  the  MLE,  why  would  MSL 
be  ever  preferred  to  MSM? 

First,  observe  that  MSL  is  in  principle  straightforward  and  simple  to  implement. 
Given  the  parametric  assumptions,  the  optimal  weighting  of  observations  is  inherent 
to  the  MLE  method.  The  MSM,  analogous  to  the  GMM,  in  contrast  requires  us  to  work 
with  products  of  weight  (or  instrumental  variable)  functions  and  residuals,  and  these 
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components  may  be  correlated.  The  numerical  instability  of  the  GMM  estimator  (with¬ 
out  simulation)  has  been  documented  by,  for  example,  Altonji  and  Segal  (1996)  (see 
Section  6.3.5).  Similarly,  Geweke,  Keane,  and  Runkle  (1997)  and  McFadden  and  Ruud 
(1994)  have  provided  evidence  of  the  instability  of  the  MSM  estimator.  Nevertheless, 
although  simplicity  favors  MSL,  some  of  the  problems  associated  with  ensuring  that 
sufficient  number  of  simulations  are  applied  should  not  be  underestimated. 


12.5.6.  Unobserved  Heterogeneity  Example 

We  return  to  the  example  of  Section  12.4.5.  Then  y,  ~  J\f[6  +  u, ,  1],  where  u,  has 
density  g(uj)  given  in  (12.24).  Since  E[y,  —  0  —  //,  ]  =  0,  we  can  estimate  0  by  the 
method  of  moments  estimator  that  solves 

1  N 

_  J2(yi  -  9  -  E[«,-])  =  0,  (12.39) 


yielding  0mm  =  y  —  E[S].  Suppose  that  E[w]  is  unknown.  Then  we  can  instead  use 
the  MSM  estimator  0msm  that  solves 

1  N  f  1  S  \ 

"240» 

1  =  1  \  5=1  / 

where  u  ■  are  iid  random  draws  from  the  extreme  value  distribution. 

The  estimating  equation  (12.40)  can  be  solved,  yielding 

0MSM  =  y  —  M,  (12.41) 

where  u  =  (N S)~l  JT  u]  is  an  average  over  both  N  and  S.  More  generally,  how¬ 
ever,  an  iterative  method  may  be  needed  to  compute  the  MSM  estimator. 

The  variance  of  $msm  is  easily  obtained.  By  construction  the  simulated  draws  of  u 
are  independent  of  each  other  and  of  the  original  data  y,  so  that  V[(9Msm  I  =  V[y]  + 
V[m].  Now  V[y]  =  (a^  +  1  )/N .  Since  u  is  the  average  of  NS  draws  of  u,  V[m]  = 
of; /NS,  it  follows  that 


V[?msm]  =  V[v]  +  V[S]  (12.42) 


N  NS' 

This  can  be  consistently  estimated  using  a l  =  (NS)  1  Sf=i  ("/  —  i<)2. 

We  consider  a  sample  { yi ,  ■ . . ,  yioo}  of  size  N  =  100  generated  from  the  model 
(12.24)  with  0  =  1.  Table  12.3  gives  the  MSM  estimator  as  the  number  of  draws 
S  — >  oo.  As  the  number  of  simulations  S  increases  the  MSM  estimator  approaches 
the  method  of  moments  estimate,  and  the  standard  error  falls. 
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Table  12.3.  Method  of  Simulated  Moments  Estimation:  Example 


Number  of  Simulations 

S=  1 

S=  10 

S=  100 

S  =  1,000 

S  =  oo  (MM) 

MSM  estimate  0 

Standard  error 

1.0073 

(.2471) 

1.1096 

(.1657) 

1.2012 

(.1681) 

1.1887 

(.1676) 

1.1879 

(.1684) 

12.6.  Indirect  Inference 

In  this  section  we  outline  another  simulation-based  approach  to  model  estimation 
that  is  sometimes  used  when  one  wants  to  use  or  estimate  a  model  that  is  relatively 
simple  to  estimate,  even  when  the  underlying  dgp  is  thought  be  more  complex  and 
harder  to  estimate.  There  are  several  variants  and  interpretations  of  the  approach;  see 
Gourieroux,  Monfort,  and  Renault  (1993),  Smith  (1993),  and  Gallant  and  Tauchen 
(1996).  The  approach  has  also  been  called  the  moment  matching  approach.  Our  ex¬ 
position  essentially  follows  the  first  of  the  aforementioned  references. 

Suppose  that  the  parametrically  specified  dgp  is  denoted  by  the  pdf  /  (y;  6) ,  6  e 
7 Zq ,  whose  parameters  are  relatively  difficult  to  estimate.  Suppose  that  we  can  specify 
an  auxiliary  model  with  the  dgp  fa  (y;  (3) ,  f3  e  EL' ,  which  is  easier  to  estimate  by  the 
quasi-(sometimes  also  called  “pseudo-”)  maximum  likelihood  method.  For  reasons  of 
identification  that  are  further  discussed  in  the  following,  we  assume  that  the  dimension 
of  (3  is  not  smaller  than  the  dimension  of  6,  that  is,  r  >  q.  For  example,  the  auxiliary 
model  may  be  an  approximation  to  the  exact  likelihood,  or  it  may  be  an  exact  likeli¬ 
hood  of  an  approximate  model.  For  a  given  sample,  let  (3  denote  the  QML  estimates. 
Then,  by  the  results  covered  in  Section  5.7,  we  know  that  (3  is  in  general  an  inconsis¬ 
tent  estimator  of  6,  and  under  some  regularity  conditions  it  converges  in  probability 
to  a  value  called  the  pseudo-true  value,  which  is  a  function  of  6.  The  function  that 
connects  the  parameters  of  the  auxiliary  model  to  those  of  the  dgp  is  called  the  bind¬ 
ing  function,  denoted  as  h  (6).  The  analytical  form  of  this  function  may  or  may  not  be 
known.  Therefore,  it  may  not  always  be  possible  to  obtain  6  =  h-1  ( (3 )  or  0  =  h  l(f3). 

The  method  of  indirect  inference  can  be  used  to  obtain  an  improved  QML  estimator 
with  a  smaller  asymptotic  bias  than  (3.  The  idea  is  to  use  the  model  under  /  (y;  6)  to 
generate  by  simulation  pseudo-observations  y(,s)  and  to  use  the  auxiliary  model  under 
/"  (y(s);  (3)  to  estimate  (3  ,  where  s  refers  to  the  .s  t h  simulation.  The  indirect  estimator 
is  defined  by  the  solution  of 

?  =  argmin(/3(S)  -  3)'fl(/ 3°”  -  3),  (12.43) 

0 

where  F2  is  a  given  symmetric  positive  definite  matrix.  This  estimator  is  similar  to  the 
minimum  distance  estimator  considered  in  Section  6.7.  That  is,  we  sequentially  gener¬ 
ate  pseudo-observations  and  estimate  the  parameters  of  the  auxiliary  model  based  on 
the  pseudo-observations.  The  iterations  continue  until  the  quadratic  form  in  (12.43)  is 
minimized.  A  very  important  point  is  that  the  seed  that  generates  the  pseudo-random 
observations  y(s>  is  kept  unchanged,  so  that  variations  in  the  pseudo-observations 
across  simulations  are  due  to  the  variation  in  f3  . 
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Before  further  discussion,  we  consider  a  simple  but  specific  example  involving  a 
nonlinear  dgp  and  a  linear  auxiliary  model.  The  motivation  is  that  the  auxiliary  model 
should  be  easy  to  estimate,  and  the  dgp  should  be  easy  to  simulate. 

Let  the  dgp  be  as  follows: 


yi  =  exp  (x'7)  +  Uj ,  (12.44) 

Uj  ~  J\f  [0,  <T2]  . 

Let  the  auxiliary  model  be  the  following: 

>-,  =  x'/3+  e,-,  (12.45) 

Si  ~  A f  [0,  ct2]  . 


Note  the  following  interpretations: 


9E[y|x] 

3x 

9  InE  [y  |x] 
3x 


=  (3  (under  the  auxiliary  model). 


9E[y|x]  1 

3x  E  [y|x] 


(under  the  dgp). 


Therefore,  the  binding  function  is  7E[y|x]  =  (3,  or  7  =  ( E [ 3’ | x ) )  " 1  f3.  Note  that 
dim[/3]  equals  dim[7]. 

Given  the  data  (x,-,  y,-,  i  =  1, . . . ,  N)  and  the  least-squares  estimator  (3,  and  given 
a  N -dimensional  pseudo-random  draw,  denoted  ui0\  we  generate  _y(f  1 1  (i  =  1, . . , ,  N) 
using 

yf]  =  exp(X;/3)  +  m'0) 


and  obtain  a  revised  estimator  /3(1)  =  (J]  x,x-)  1  ^  x,- yt  1  ^ ,  which  in  turn  is  used  to 
generate  another  set  of  pseudo-observations.  The  entire  simulation  cycle  is  repeated, 
holding  u,0)  fixed,  until  (/3<s)  —  /3),f2(/3,')  —  (3)  approaches  a  constant  value  to  desired 
accuracy.  In  the  present  case  it  is  reasonable  to  set  il  equal  to  either  the  identity  ma¬ 
trix  or  X'X,  the  latter  choice  implying  that  prediction  from  the  auxiliary  model  is  a 
modeling  objective.  The  resulting  estimate  of  7  is  the  indirect  estimator. 

In  other  applications  dim(/3)  will  exceed  dim  (0),  so  a  unique  value  of  6  may  not  be 
available.  Indeed,  in  the  absence  of  an  analytical  binding  function,  we  cannot  recover 
G,  even  if  the  two  dimensions  are  the  same.  Then  one  settles  for  the  best  indirect 
estimates  of  the  auxiliary  model  parameters. 

To  see  the  connection  between  the  indirect  estimator  and  moment  matching,  set 
n  =  X'X ,  then  0S)  -  pyX'X(f3(s)  -  3)  =  (/^X  -  pX)'(j3(s)X  -  /3X),  which  indi¬ 
cates  that  the  indirect  estimator  is  “matching”  the  first  moment  of  distribution.  If  one 
also  wants  to  match  the  second  moment,  the  vector  (3  can  be  augmented  by  additional 
parameters,  such  as  the  variance  parameter.  Thus  one  can  match  several  moments  if  so 
desired. 

Under  regularity  conditions  the  indirect  estimator  is  consistent  and  asymptotically 
normal.  The  reader  is  referred  to  the  previously  cited  works  for  additional  detail. 
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12.7.  Simulators 

As  in  Section  12.3.2  we  consider  computation  of 

I  =  E[ft(x)]  =  J  h(x)g(x)dx,  (12.46) 

where  for  simplicity  x  is  often  a  scalar.  As  in  Section  12.3,  x  is  being  used  here  to 
denote  the  variable  being  integrated  out,  whereas  in  application  sections  u  denotes  the 
variable  being  integrated  out  as  x  denotes  the  regressors. 

A  simulator  is  a  method  to  compute  I.  There  are  many  ways  to  do  so,  aside  from 
direct  Monte  Carlo  integration  given  in  (12.14).  Ideally,  simulators  should  be  unbiased, 
because  many  of  the  results  in  Sections  12.4  and  12.5  assume  an  unbiased  simulator, 
and  smooth  so  that  standard  iterative  gradient  methods  can  be  used.  Even  then  the 
computing  time  for  estimation  of  empirically  interesting  models  can  be  a  formidable 
obstacle.  We  present  a  few  of  the  many  clever  procedures  that  have  been  developed 
to  speed  up  simulation  by  reducing,  for  any  given  number  of  simulation  draws,  the 
simulation  variance  relative  to  crude  methods  such  as  direct  Monte  Carlo  integration. 
A  more  complete  survey  is  given  in  Geweke  and  Keane  (2001). 


12.7.1.  Frequency  Simulator 

We  begin  with  an  example,  the  frequency  simulator,  that  can  be  used  for  some  discrete 
models.  This  highlights  well  some  of  the  complications  that  can  arise  in  simulation. 

Suppose  the  function  h(x)  is  an  indicator  function  that  takes  value  I  if  x  €  A  and  0 
otherwise.  Then  we  wish  to  compute 

I  —  J  1  (x  e  A)g{x)dx. 

Direct  Monte  Carlo  integration  yields  the  estimate 

1  s 

^FREQ  =  —  ^2  6  A)> 

kJ  , 

s= 1 

where  xs,s  =  1 , . . . ,  S,  are  S  draws  from  g(x).  This  is  called  the  frequency  simulator 
as  it  estimates  I  by  the  relative  frequency  with  which  the  S  draws  of  Xs  fall  in  A. 

A  leading  potential  application  -  one  that  has  motivated  much  of  the  econometrics 
literature  on  simulation  methods  -  is  the  multinomial  discrete  choice  model  introduced 
in  Section  12.2.2.  For  a  three-alternative  model,  the  probability  p\  of  choosing  the  first 
alternative  is  given  by  (12.3),  an  integral  over  the  positive  orthant  of  a  bivariate  normal 
distribution.  The  frequency  simulator  p)  is  then  the  proportion  of  draws  (u  \ ,  u^)  from 
the  bivariate  normal  with  u\  >  0  and  <  >  0. 

The  frequency  simulator  has  several  limitations.  First,  it  is  neither  differentiable  nor 
continuous  in  parameters  9,  which  appear  in  l(x  e  A)  and/or  g(x).  So  small  changes 
in  6  lead  to  the  same  number  of  draws  falling  in  the  positive  orthant.  For  this  reason 
McFadden  (1989)  and  Pakes  and  Pollard  (1989)  presented  a  more  general  asymptotic 
theory  that  covers  such  nonsmooth  simulators.  In  practice,  however,  it  is  best  to  use 
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alternative  smooth  simulators  that  are  differentiable  in  parameters  as  this  permits 
computation  using  the  usual  gradient  methods. 

Second,  this  simulator  is  very  inefficient  if  only  a  small  fraction  of  x  e  A.  For  ex¬ 
ample,  for  a  discrete  choice  model  with  p\  =  0.001,  even  with  10,000  draws  of  S  the 
estimate  ~p\  will  be  very  noisy.  Similar  problems  arise  more  generally  in  direct  Monte 
Carlo  evaluation  of  (12.46)  with  continuous  h(x)  if  the  probability  of  drawing  x  is  low 
in  regions  where  li(x)  is  relatively  large. 

Third,  this  simulator  may  have  problems  at  the  boundary  and  give  an  estimate  7=0 
or  7  =  1  even  if  the  model  imposes  0  <  7  <  1  and  this  condition  is  necessary  for 
model  estimation. 


12.7.2.  Importance  Sampling 


The  importance  sampling  simulator  reexpresses  the  integral  (12.46)  as 


7 


j  ( h(x)g(x)  \ 

J  V  PM  ) 

J  w(x)p(x)dx, 


p(x)dx 


(12.47) 


where  p(x)  is  a  density  function  chosen  so  that  (a)  it  is  easy  to  draw  from  p(x),  (b) 
p(x)  has  the  same  support  as  the  original  domain  of  integration,  and  (c)  w  (x)  = 
h(x)g(x)/ p(x)  is  easy  to  evaluate,  is  bounded,  and  has  finite  variance.  We  then  use 
the  direct  Monte  Carlo  integral  estimate  based  on  (12.47)  rather  than  (12.46), 


hs 


(12.48) 


where  xs ,  s  =  1, . . . ,  5,  are  draws  from  p(x)  rather  than  g(x).  The  term  importance 
sampling  is  used  because  w(x)  determines  the  weight  or  “importance”  of  different 
points  in  the  sample  space.  Importance  sampling  has  been  employed  in  the  Bayesian 
simulation  literature  for  many  years  and  was  introduced  into  Bayesian  econometrics  by 
Kloek  and  van  Dijk  (1978)  as  a  way  of  evaluating  posterior  distributions.  This  material 
is  further  discussed  in  Section  13.4. 

The  importance  sampler  7is  has  variance  S~ 1  V;,[in(x)],  given  independent  draws 
from  p(x).  This  variance  is  clearly  minimized  if  w(x)  is  a  constant  over  the  entire  range 
of  integration,  since  then  V p[w{x)]  is  zero.  This  is  done  by  setting  w(x)  =  E,,[/z(jr)|, 
as  then  p(x)  =  h(x)g(x)/E,,\h(x)\  is  a  density  that  integrates  to  1.  Unfortunately,  this 
theoretically  ideal  importance  sampling  estimate  is  not  practicable,  as  E g[h(x)}  is  un¬ 
known.  However,  it  does  indicate  the  potential  gains  to  importance  sampling,  espe¬ 
cially  if  p(x)  is  chosen  so  that  w(x)  is  fairly  flat. 

Even  if  importance  sampling  leads  to  an  increased  variance,  which  can  occur  in 
practice,  it  does  have  other  attractions.  It  produces  a  smooth  sampler  if  w(x)  is  smooth 
in  the  parameters  to  be  estimated.  Moreover,  it  is  useful  if  draws  from  g(x)  are  difficult, 
as  can  often  be  the  case  if  x  is  a  vector  of  correlated  random  variables. 

For  the  multinomial  probit  discrete  choice  model  a  popular  importance  sampler  is 
the  GHK  simulator,  due  to  Geweke  (1992),  Hajivassiliou  and  McFadden  (1994),  and 
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Keane  (1994).  This  recursively  truncates  the  multivariate  normal  pdf  so  that  draws 
are  restricted  to  the  positive  orthant.  Advantages  of  this  simulator  compared  to  the 
frequency  simulator  are  that  it  is  smooth,  requires  many  fewer  draws  for  alternatives 
with  low  probability  of  being  chosen,  and  is  unlikely  to  have  boundary  problems. 


12.7.3.  Variance  Reduction  by  Antithetic  Acceleration 

The  preceding  methods  assume  independent  draws,  using  methods  to  be  detailed  in 
Section  12.8,  from  an  appropriate  distribution  such  as  g(x)  or,  if  importance  sampling 
is  used,  from  p(x). 

Variance  reduction  methods  instead  use  dependent  draws  as  these  can  reduce  the 
variance  of  a  simulator.  A  leading  example  is  antithetic  sampling  that  uses  nega¬ 
tively  correlated  draws.  Ripley  (1987,  pp.  129-132),  Geweke  (1988),  and  Hajivassiliou 
(2000)  provide  a  discussion  of  this  technique  and  Geweke  (1995)  surveys  this  and  sev¬ 
eral  other  variance  reduction  techniques. 

Suppose  we  wish  to  evaluate  the  integral  /  in  (12.46),  where  x  is  assumed  to  have 
zero  mean  and  symmetric  density  g(x).  The  direct  Monte  Carlo  integral,  based  on  2 5 
simulated  iid  draws  from  g(x),  is 

^  1  2S 

his  (x)  —  —  ^h(xs) 

5=1 

and,  given  independence  of  the  25  draws,  has  variance 

V[i2iW]  =  ^V[*W]. 

Antithetic  sampling  uses  an  alternative  estimate  based  on  only  5  iid  draws, 

1  s  i 

hA,sW  =  s  E  2{KxS)  +  K~xS))’  (12-49) 

5=1 

which  is  an  average  of  h(x)  evaluated  at  xs  and  —xs.  The  pair  (xs ,  — x')  is  said  to  be 
an  antithetic  pair  and  yields  an  unbiased  estimate  of  /  since  we  assume  x  is  symmet¬ 
rically  distributed  with  zero  mean.  If  the  mean  is  instead  //  then  (xs ,  2/r  —  xs )  is  an 
antithetic  pair.  Given  5  independent  draws  of  xs  the  variance  of  /iA,,s'  ( x )  is 

^  1  s  i 

V[/ZA ,s  (A)]  =  E  ~^y[h(xs)\  +  2Covffi(xi),  h(-xs)]  +  V[h(-xs)]) 

5=1 

=  ^  (V[/i(x)]  +  Co v[/i(jc),  h(-x)]) . 

Antithetic  sampling  will  therefore  be  more  efficient  than  regular  iid  sampling  if  the 
covariance  term  is  negative,  since  then  the  variance  of  h^,s  (x)  is  smaller  than  that  of 
his  (a)-  By  switching  the  sign  of  the  draw,  and  then  reusing  the  draw,  an  attempt  is 
made  to  induce  negative  correlation  in  the  simulator.  Negative  correlation  is  assured 
when  the  function  is  linear,  and  also  if  the  nonlinearity  is  not  too  severe.  However,  in 
general,  one  cannot  be  certain  that  efficiency  gains  will  be  realized.  For  example,  if 
h(-)  is  symmetric  about  zero  then  Cov[/;(x),  h(—x)\  =  V[ /i(x) ]. 
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Antithetic  sampling  can  be  extended  to  asymmetric  density  g(x).  Suppose  x  can  be 
drawn  using  the  inverse  transformation  method  given  later  in  Section  12.8.2.  Then  one 
can  draw  u,  say,  from  the  uniform  [0,  1],  generate  the  antithetic  transform  (1  —  u),  and 
then  use  the  inverse  transformation  method  to  draw  from  the  distribution  of  choice,  so 
X\  =  G^1  (u)  and  xt  =  G_1  (1  —  u),  where  G(-)  is  the  known  cdf  of  x.  Then  (xi,  x{) 
form  an  antithetic  pair  and  variance  reduction  occurs  if 

Cov[/;(G_1  (k)),  /j(G_1  (1  -  «))]  =  Co v[/  (u) ,  /  (1  -  u)]  <  0, 

where  f(u)  is  the  composite  function  h(G  "  1  (u)).  If  /(•)  is  a  monotonic  function  then 
the  variance  is  reduced  (Robert  and  Casella,  1999,  p.  112).  However,  this  property 
of  the  function  may  be  difficult  to  verify.  Further,  the  argument  applies  to  the  in¬ 
verse  transformation  approach  only,  whereas  in  practice  other  methods  are  used  in 
pseudo-random  number  generation  (see  Section  12.8).  Therefore  it  is  difficult  to  ver¬ 
ify  in  advance  whether  the  conditions  for  efficiency  gains  are  attainable  in  a  specific 
application. 

Although  the  dramatic  gains  in  efficiency  possible  in  some  special  cases  may  not 
materialize  in  more  complex  settings,  worthwhile  efficiency  gains  are  realized  in 
many  cases.  Antithetic  sampling  can  also  be  used  to  accelerate  importance  sampling 
(Danielsson  and  Richard,  1993). 

Antithetic  sampling  extends  to  multivariate  draws.  Consider  bivariate  draws  of 
(x,  _v),  where  the  density  is  symmetric  about  (0,  0).  In  this  case  sign  reversal  is  done 
first  element  by  element  and  then  for  the  pair.  Thus  the  antithetic  quadruple  consists  of 
((xs ,  v ' ) ,  (—xs,  y') ,  (xs,  —  v v ) ,  (—Xs,  —  y')).  For  an  m -dimensional  draw  the  same 
idea  is  repeated  for  all  tuples. 


12.7.4.  Computation  Using  Quasi-Random  Sequences 

A  second  method  of  variance  reduction  involves  replacing  pseudo-random  numbers  by 
quasi-random  numbers,  which  are  systematic  simulation  draws  designed  to  provide 
better  coverage  of  the  sample  space.  A  potential  limitation  of  the  approach  is  that 
randomness  is  required  to  apply  the  laws  of  large  numbers  and  central  limit  theorems 
that  justify  the  simulation-based  approach. 

Quasi-Monte  Carlo  methods  use  nonrandom  points  within  the  domain  of  integration 
instead  of  using  S  pseudo-random  points.  A  leading  example  is  Halton  sequences, 
summarized  in  Press  et  al.  (1993)  and  introduced  into  the  econometrics  literature  by 
Bhat  (2001)  and  Train  (2003). 

Halton  sequences  have  two  desirable  properties.  First,  they  are  designed  to  give 
fairly  even  coverage  over  the  domain  of  the  sampling  distribution.  With  more  evenly 
spread  draws  for  each  observation,  the  simulated  probabilities  vary  less  over  observa¬ 
tions,  relative  to  those  calculated  with  random  draws.  This  is  similar  to  deterministic 
evaluation  of  an  integral  over  a  specified  grid.  Second,  with  Halton  sequences,  the 
draws  for  one  observation  tend  to  fill  in  the  spaces  left  empty  by  the  previous  obser¬ 
vations.  The  simulated  probabilities  are,  therefore,  negatively  correlated  over  observa¬ 
tions.  As  in  the  case  of  antithetic  variates,  this  negative  correlation  reduces  the  vari¬ 
ance  of  the  simulated  function.  Under  suitable  regularity  conditions  it  can  be  shown 
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that  the  integration  error  using  pseudo-random  sequences  is  of  order  /V1 ,  compared 
to  pseudo-random  sequences  where  the  convergence  rate  is  /V-1/2  (Bhat,  2001). 

Halton  sequences  are  best  described  by  example.  Suppose  that  the  function  to  be 
simulated  depends  on  one  random  variable.  The  starting  point  is  a  prime  number.  The 
Halton  sequence  based  on  the  prime  number  2  is  constructed  as  follows.  Divide  the  unit 
interval  (0,  1)  into  two  parts.  The  dividing  point  1/2  becomes  the  first  element  of  the 
Halton  sequence.  Next  divide  each  part  into  two  more  parts.  The  dividing  points,  1/4 
and  3/4,  become  the  next  two  elements  of  the  sequence.  Divide  each  of  the  four  parts 
into  two  parts  each,  and  continue  to  obtain  the  sequence  {1/2,  1  /4,  3/4,  1  /8,  3/8, . . .}. 
Similarly,  the  sequence  based  on  the  prime  number  3  is  {1/3,  2/3,  1  /9,  2/9,  4/9, . . .}. 
Halton  sequences  on  nonprime  numbers  are  not  unique  because  the  Halton  sequence 
for  a  nonprime  number  divides  the  unit  space  in  the  same  way  as  each  of  the  prime 
numbers  that  constitute  the  nonprime. 

The  length  of  each  sequence  is  determined  by  the  number  of  observations  N  and 
the  numbers  of  simulation  draws  S.  One  discards  the  first  few  (say  20)  elements  of  the 
sequence  as  the  early  elements  have  a  tendency  to  be  correlated  over  Halton  sequences 
with  different  primes  (see  Train,  2003,  for  an  example).  Consequently,  one  could  begin 
by  generating  Halton  sequences  of  length  N  x  S  +  20  and  discard  the  first  20  elements 
of  each  sequence.  For  each  element  of  each  sequence,  calculate  the  inverse  of  the 
cumulative  normal  distribution.  The  resulting  values  are  the  Halton  draws  from  the 
sampling  distribution. 

One  major  advantage  of  quasi-random  number  draws  is  that  the  draws  are  designed 
to  cover  the  sample  space  of  random  numbers  in  a  more  uniform  fashion  than  in 
the  case  of  pseudo-random  numbers.  This  can  be  seen  visually  in  Figure  12.1.  In  this 
figure,  Panel  2  shows  a  draw  from  a  bivariate  normal  distribution  constructed  using 
a  Halton  sequence.  The  remaining  three  panels  show  pseudo-random  number  draws 
from  the  same  distribution.  The  more  even  coverage  of  the  sample  space  is  evident  in 
the  former  case. 

For  more  thorough  discussion  and  examples  of  simulation-based  estimation  that  use 
Halton  draws  and  impressive  evidence  of  the  relative  efficiency  of  the  approach  in  one 
or  more  dimensions,  see  Train  (2003,  Chapter  9).  The  method  works  very  well  for 
multinomial  logit  model  with  normally  distributed  random  parameters  (Section  15.7). 


12.8.  Methods  of  Drawing  Random  Variates 

The  preceding  simulators  require  draws  of  random  variates.  In  this  section  we  summa¬ 
rize  methods  to  take  such  draws  from  a  density,  denoted  g(.r)  or  p(x)  in  Section  12.7 
and  denoted  f(x)  in  this  section.  Usually  it  is  sufficient  to  obtain  draws  from  the 
uniform  or  the  standard  normal  (which  is  possible  in  most  popular  software)  since 
these  can  form  the  basis  for  making  draws  from  distributions  other  than  the  uniform 
or  normal. 

If  the  draws  are  to  be  used  for  simulation-based  estimation  then  all  draws  from  the 
uniform  or  standard  normal  should  be  made  before  any  estimation,  to  prevent  “chatter,” 
whereby  iterative  methods  fail  to  converge  owing  to  noise  created  by  new  draws  at 
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Panel  1 :  Pseudo-random  draws  Panel  2:  Halton  sequence  draws 
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Panel  3:  Pseudo-random  draws  Panel  4:  Pseudo-random  draws 

Figure  12.1:  Halton  sequence  draws  (panel  2)  compared  to  pseudo-random  draws. 

each  iteration.  For  example,  if  x  ~  a2]  and  estimates  of  /i  and  a  change  over 

iterations,  then  we  make  NS  initial  draws  of  z  ~A/"[0,  1]  and  then  over  iterations 
recompute  x  =  /x  +  cr  z  using  the  original  draws  of  z. 

This  section  provides  a  basic  discussion  of  some  standard  methods  for  gener¬ 
ating  random  variates.  For  more  advanced  or  extensive  treatments,  there  are  many 
good  monographs  and  surveys,  including  those  by  Bradley,  Fox,  and  Schrage  (1983), 
Dagpunar  (1988),  Devroye  (1986),  and  Ripley  (1987). 

Before  presenting  the  methods,  note  that  the  term  random  number  generation  is 
an  oxymoron.  A  more  accurate  description  is  given  by  the  term  pseudo-random 
numbers.  The  essential  characteristic  of  these  generators  is  that  they  use  determin¬ 
istic  devices  to  produce  long  chains  of  numbers  that  mimic  the  properties  of  the  real¬ 
izations  from  some  target  distribution.  The  specific  target  distribution  will  depend  on 
the  context,  but  for  the  purposes  of  this  book  uniform,  normal,  exponential,  gamma, 
logistic,  and  Poisson  distributions  are  standard.  The  chain  process  is  started  up  by  sup¬ 
plying  a  seed.  After  some  finite  but  large  number  of  values  have  been  generated  the 
cycle  of  numbers  repeats  itself.  That  is,  the  computer  algorithms  will  generate  exactly 
the  same  numbers  beginning  with  a  given  seed.  Good  random  number  generators  are 
those  that  generate  a  long  chain  of  numbers  without  recycling  and  without  any  built-in 
dependence.  The  key  consideration  in  choosing  generators  is  whether  the  generated 
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distribution  closely  mimics  the  properties  of  the  target  distribution  at  a  reasonable 
computational  cost. 

12.8.1.  Pseudo-Random  Uniform  Number  Generators 

Pseudo-random  uniform  numbers  are  constructed  using  a  deterministic  sequence 
that  mimics  the  statistical  properties  of  a  sequence  of  uniform  random  numbers.  A 
good  generator  has  a  long  period,  has  a  distribution  close  to  uniform,  and  produces  in¬ 
dependent  draws.  It  is  important  to  have  a  good  generator,  as  pseudo-random  numbers 
from  virtually  any  distribution  can  then  be  obtained  by  transforming  uniform  pseudo¬ 
random  numbers  (Bradley  et  al.,  1983,  p.  24). 

A  standard  generator  begins  with  the  equation 

Xj  —  (kXj- 1  +  c)  mod  m, 

where  the  modulus  operator  a  mod  b  forms  the  remainder  when  a  is  divided  by  b.  This 
produces  a  sequence  of  integers  between  0  and  m,  and  the  uniform  random  variable 
is  then  obtained  as  Rj  =  Xj/m  (Ripley,  1987,  p.  20).  A  value  for  Xq,  referred  to  as 
the  seed,  is  needed  to  initiate  the  generator.  The  uniform  random  sequences  generated 
are  deterministic,  which  permits  replication  as  the  same  numbers  should  be  drawn 
if  analysis  is  repeated  with  the  same  value  of  the  seed.  The  periodicity  of  the  cycle 
depends  on  Xq,  k ,  and  c.  If  computation  is  done  using  32-bit  integer  arithmetic  the 
maximum  periodicity  is  approximately  231  ~  2.1  x  109.  However,  it  is  easy  to  choose 
poor  values  of  Xq,  k ,  and  c  so  that  the  periodicity  is  much  lower  than  this.  Books  such 
as  that  by  Press  et  al.  (1993)  should  be  consulted  for  potential  pitfalls. 

12.8.2.  Nonuniform  Variates 

Random  variables  from  many  other  distributions,  including  the  normal  itself,  are  usu¬ 
ally  based  on  an  initial  draw  of  a  uniform  random  number.  Four  commonly  used  meth¬ 
ods  are  (1)  inverse  transformation,  (2)  transformation,  (3)  accept-reject,  and  (4)  mixing 
and  compounding. 


Inverse  Transformation 

Let  F(x)  denote  the  cdf  of  the  continuous  random  variable  x,  that  is, 

F  (x)  =  Pr  [X  <  x] . 

Given  a  draw  of  a  uniform  variate  r,  0  <  r  <  1 ,  the  inverse  transformation 

x  =  F~l  (; r ) 

gives  a  unique  value  of  x  because  F  is  continuous  and  monotonically  increasing. 

For  example,  the  cdf  of  the  unit  exponential  is  1  —  e~x .  Solving  r  =  1  —  e~x 
yields  x  =  —  ln(l  —  r).  If  we  make  a  draw  from  uniform  [0,  1]  and  get  0.64,  then 
x  =  —  ln(l  —  0.64)  =  1.0217.  Figure  12.2  plots  the  cdf  of  X  and  shows  graphically 
how  this  method  works.  An  arbitrary  point  on  the  vertical  axis  at  height  r  is  selected 
and  the  corresponding  value  on  the  horizontal  axis  is  obtained  by  completing  a  rectan¬ 
gle.  This  is  the  inverse  transformation. 
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Inverse  Transformation  Method 


Draw  of  0.64  (vertical  axis)  yields  x  =  1 .02  (horizontal  axis). 

Figure  12.2:  Inverse  transformation  method  for  making  draws  from  the  unit  exponential.  A 
random  uniform  draw  of  0.64  (so  F(x)  =  1  -  exp(-x)  =  0.64)  yields  x  =  1 .02. 

This  method  is  particularly  easy  to  use  if  the  analytical  form  of  F  (•)  is  given  and  x 
is  a  continuous  random  variable.  If  there  is  no  closed-form  expression  available,  then 
the  method  is  still  often  feasible,  albeit  computationally  more  costly,  as  the  inverse 
cdfs  of  standard  distributions  are  often  available  as  functions  in  programs. 

The  method  can  be  extended  to  discrete  random  variables  with  a  cdf  that  is  a  step 
function.  For  example,  if  x  takes  integer  values  then  a  uniform  draw  r  =  0.312  leads  to 
a  draw  of  x  =  /',  where  the  integer  j  is  such  that  F(j  —  1)  <  0.312  and  F(j)  >  0.312. 

A  standard  method  for  generating  normal  random  variates  is  the  Box-Muller 
method.  This  uses  the  inverse  transformation  method,  applied  to  the  joint  distribu¬ 
tion  of  two  independent  normal  variates  rather  than  to  a  single  variate.  Specifically,  if 
r\  and  r?  are  iid  uniform  then  x\  =  —1  In  r\  cosilnrf)  and  xi  =  sj  —1  In  r i  smilnrf) 

are  iid  A/"[0,  1]. 


Transformation 

In  some  cases  a  random  variable  with  the  desired  density  can  be  obtained  by  suitable 
transformation  of  a  random  variable  whose  distribution  is  easy  to  draw  from.  Then 
random  variates  can  be  obtained  by  applying  this  same  transformation. 

This  transformation  method  is  an  obvious  way  to  make  draws  from  distributions 
based  on  the  normal.  Examples  include  squaring  standard  normal  variates  to  obtain 
random  variables  with  central  chi-square  distribution,  adding  squared  values  of  r  inde¬ 
pendent  standard  normal  variates  to  yield  chi-squared  variates  with  r  degrees  of  free¬ 
dom,  and  computing  the  mean  square  of  independent  chi-squares  to  yield  /•’-distributed 
random  variables.  Transformation  methods  are  not  restricted  to  distributions  based  on 
the  normal. 


Accept-Reject  Methods 

Suppose  we  want  to  draw  from  the  density  f(x )  but  this  is  difficult,  however,  there  is 
another  density  g(x)  that  covers  fix )  in  the  sense  that  fix)  <  kg{x)  for  all  x  for  some 
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Accept-reject  Method 


Figure  12.3:  Accept-reject  method  draws  from  density  g(x)  where  kg{x)  envelopes  the 
desired  density  f(x). 

finite  constant  k.  This  is  depicted  in  Figure  12.3,  where  the  thick  line  serves  to  mimic 
the  envelope  kg(x). 

The  accept-reject  method  draws  from  g(x),  rather  than  fix).  The  draw  is  ac¬ 
cepted,  x  =  r,  if 


r  <  - - , 

kg(x') 

where  r  is  a  draw  from  the  uniform  distribution.  If  the  condition  is  not  satisfied  then 
the  draw  is  rejected  and  further  draws  are  made  until  the  condition  is  satisfied.  The 
appeal  of  the  method  depends  on  the  ease  of  drawing  from  g(x)  rather  than  fix).  The 
limitation  is  that  on  average  a  draw  will  be  accepted  with  probability  I  /  k.  so  that  many 
draws  are  needed  if  k  is  large. 

To  see  how  this  method  works,  let  Y  denote  the  random  variable  generated  by  the 
accept-reject  method,  X  denote  a  random  variable  with  density  g(x),  and  U  denote  a 
draw  from  the  uniform.  Then  Y  has  cdf 


Pr  [X  <y,U  <  fix)/ kgjxf] 
Pr  [U  <  fix)/ kgix)] 

f-oc  f/(JC)/kgW  dugjx)dx 
fZfofM/kgMdugix)dx 
PPf  jx)/ kgjx)]gjx)dx 
fZolfix)/  kgix)]gix)dx 


PlrySfix)/ k]dx 

fPUix)/ k]dx 


fix)dx, 


which  is  the  cdf  corresponding  to  the  density  fix)  as  desired. 
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Composition 

Sometimes  the  density  f(x)  can  be  expressed  as  being  that  from  a  mixture  or  a  com¬ 
pound  distribution,  with 

fix)  =  J  g(x\e)h  (s)ds. 

Then  a  draw  from  fix)  can  be  obtained  by  first  making  a  draw  of  e  from  density  h  (e) 
and  then  making  a  draw  of  x  from  the  conditional  density  g(x|e). 

As  an  example,  consider  drawing  from  the  negative  binomial  distribution  with  mean 
X  and  variance  X( I  +  aX).  where  both  X  and  a  are  given  constants.  Here  we  may  use 
the  fact  that  the  negative  binomial  distribution  can  be  regarded  as  a  Poisson-gamma 
mixture  (see  Chapter  20).  First,  one  draws  e  from  a  gamma  distribution  with  mean  1 
and  variance  o',  which  can  be  done  by  a  transformation  of  the  exponential.  Second, 
one  draws  from  the  Poisson  distribution  with  mean  Xs,  given  s  from  the  previous  step. 

If  h{s)  is  a  discrete  distribution  with  point  mass  pj  at  C  points,  j  =  1,  . . . ,  C,  then 
the  previous  integration  step  is  replaced  by  summation.  Thus, 

c 

fix)  =  ^pjgix\e  =  sj). 
i=  i 

Then,  to  make  S  draws  from  fix),  we  draw  Spj  observations  each  from  gix\s  =  Sj), 
and  “compose”  the  required  sample  of  S  values  by  pooling  the  draws. 


Some  Standard  Generators 

The  tables  in  Appendix  B  describes  pseudo-random  number  generation  for  several 
standard  condnuous  and  discrete  cases.  They  are  based  on  the  assumption  that  r,  r  i ,  /'2 , 

. . .  are  values  of  independent  uniform  [0,  1]  random  variables  R,  R\ ,  Ri, _ Note 

that  there  may  exist  different  methods  to  generate  the  corresponding  random  variable; 
we  list  only  one  or  two  of  these  methods. 


12.8.3.  Multivariate  Distributions 

Draws  from  multivariate  distributions  are  generally  much  more  complicated  than 
draws  from  univariate  distributions.  For  example,  methods  such  as  inverse  transfor¬ 
mation  and  transformation  may  no  longer  be  applicable.  For  many  multivariate  dis¬ 
tributions  the  method  of  mixing  or  composition  can  be  used,  as  many  multivariate 
distributions  are  mixture  distributions. 

Quite  general  methods  are  Gibbs  sampling  and  other  Markov  chain  Monte  Carlo 
methods.  These  are  deferred  to  Section  13.5,  as  they  are  extensively  applied  in 
Bayesian  analysis,  which  uses  complicated  multivariate  distributions.  As  will  be  ex¬ 
plained  the  draws  made  using  the  Gibbs  sampler  may  show  some  tendency  to  be  cor¬ 
related,  a  fact  that  will  reduce  the  efficiency  of  the  simulator. 

Here  we  restrict  attention  to  the  multivariate  normal.  Then  draws  are  easily  obtained 
by  transformation  of  univariate  standard  normal  draws.  Specifically,  suppose  we  wish 
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to  make  draws  from  a  ^-dimensional  normal  distribution,  so  x  ~A/”(0,  E).  This  can 
be  done  by  transformation  based  on  the  result  that  a  positive  definite  E  has  Choleski 
decomposition 


E  =  LL', 

where  L  is  a  lower  triangular  matrix.  For  example,  for  q  =  2  the  Choleski  decompo¬ 
sition  is 


mi 

0)2 

^11 

0  ■ 

h\ 

h\ 

_  CTl2 

<722. 

_h\ 

hi  _ 

_  0 

hi  _ 

yielding  three  equations  l\ l  =  cth,  luhi  =  Gi2,  and  l\x  +  l\2  =  022  that  can  be  solved 
for  l 11,  Z21,  and  Iji-  Given  a  ^-dimensional  vector  £  whose  elements  have  standard 
normal  distribution,  it  is  easy  to  verify  that  if  e  ~  A/”(0, 1),  then  x  =  Le,  a  linear  com¬ 
bination  of  normals,  has  distribution  J\f  (0,  X).  Specifically,  E[Le]  =  0,  and  V[L£]  = 
E[L  ee'L']  =  LL'  =  X.  The  key  to  this  method  is  that  linear  combinations  of  the  nor¬ 
mal  are  also  normally  distributed,  result  that  does  not  hold  for  nonnormal  distributions. 

12.9.  Bibliographic  Notes 

Press  et  al.  (1993)  provide  a  good  starting  point  for  both  quadrature  and  Monte  Carlo  integration 
and  give  further  references,  including  some  given  elsewhere  in  this  chapter. 

The  econometrics  literature  on  simulation-based  estimation  emphasizes  the  multinomial  pro¬ 
bit  model.  The  methods  have  much  wider  applicability,  however,  and  can  be  more  easily  and 
successfully  implemented  in  other  models  that  are  less  challenging  to  fit  than  the  multinomial 
probit.  Lerman  and  Manski  (1981)  used  simulated  frequencies  to  estimate  choice  probabilities 
and  found  that  many  draws  were  needed.  McFadden  (1989)  proposed  MSM  and  demonstrated 
its  consistency  and  asymptotic  normality.  Pakes  and  Pollard  (1989)  provide  a  quite  general 
treatment  of  the  asymptotic  theory  for  both  MSM  and  MSL.  The  relatively  accessible  survey  of 
Stern  (1997)  is  an  excellent  place  to  start.  Gourieroux  and  Monfort  (1996)  provide  a  textbook 
treatment  of  the  basic  methods.  Many  other  references  are  better  read  in  the  specific  context 
of  models  that  are  discussed  in  later  chapters.  In  particular,  Hajivassiliou  and  Ruud  (1994)  em¬ 
phasize  truncated  normal  models  including  the  multinomial  probit  and  Train  (2003)  considers 
a  range  of  discrete  choice  models  including  the  random  parameters  logit. 


- Exercises - 

12-1  To  estimate  the  integral  /  =  /  t(x)g(x)dx  by  Monte  Carlo,  the  sum  /  =  Ari1 

t(Xj)g(Xj)/ p(Xi)  is  used,  where  x,  are  draws  from  the  importance  sampling  distri¬ 
bution  p(x).  Show  that  plim  /  =  /. 

12-2  For  f(6)  =  |E|-1/2  [1  +  -  p)'E_1(#  -  g)]~(v+d)l2,  consider  the  d-dimensio- 

nal  integral  fRd  f(0)de.  The  integrand  is  the  kernel  of  a  multivariate-f  density, 
so  the  correct  answer  is  the  inverse  of  the  normalizing  constant. 

(a)  Evaluate  this  integral  as  a  Monte  Carlo  average  S^1  J2s=-\  f(#(s))/b(0(s)), 
$(s)  ^  /7(e),  where  the  importance  density  h(0)  is  multivariate-f  with  the 
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same  location  and  scale  as  f(6),  but  with  a  different  degrees-of-freedom 
parameter. 

(b)  Explore  the  stability  of  this  average  as  you  vary  the  degrees  of  freedom  of 
h(0).  Increase  the  mismatch  between  f{6)  and  h(0 )  by  changing  the  location 
and  scale  of  h(9)  and  explore  further. 

12-3  For  the  MSM  estimator  in  Section  12.5.3  suppose  that  the  simulator  is  the  fre¬ 
quency  simulator. 

(a)  Show  that  Vyu[m(0o)]  =  (1+1/S)Vy[m(0o)]- 

(b)  Hence  show  that  the  effect  of  simulation  using  the  frequency  simulator  is  to 
inflate  the  variance  of  the  method  of  moments  estimator  by  (1  +  (1/S)). 

(c)  How  large  is  the  efficiency  loss  for  the  standard  errors  if  S=  10? 

12-4  For  the  example  in  Section  12.5.6  consider  the  estimator  a  that  solves  J^iLAYi  ~ 
i (“  +  erf)]  =  0.  Obtain  analytical  expressions  for  this  estimator  and  its 
variance. 

12-5  (a)  Write  an  algorithm  for  drawing  a  pseudo-random  sample  from  a  three- 
dimensional  multivariate  normal  distribution  A/"(0,  E]  with  <jjj  =  1,  j  =  1, 2.3, 
and  covariances  ct12  =  eri3  =  023  =  0.5.  Draw  a  sample  of  1 ,000  realizations 
and  compare  the  estimated  means  and  variances  with  those  of  the  dgp. 

(b)  Repeat  part  (a)  with  the  trivariate  normal  being  replaced  by  a  Student’s 
f-distribution  with  five  degrees  of  freedom. 

12-6  Write  a  computing  procedure  to  make  draws  from  a  univariate  truncated  nor¬ 
mal  density  T A f[a,b][n,  o'2]  using  the  inverse  transform  method  given  in  Section 
12.8.2.  Here  [a,  b]  are  lower  and  upper  truncation  points.  Choose  /z  =  1,  a2  =  4. 
and  a  =  3,  b=  4. 

12-7  Consider  the  standard  binary  logit  regression  model  (see  Section  14.3). 

(a)  Write  down  the  log-likelihood  function. 

(b)  Introduce  a  random  intercept  assumption  in  which  the  intercept  is  drawn 
from  a  suitable  distribution  with  finite  mean  and  variance.  What  justifica¬ 
tion  can  you  offer  for  introducing  an  unobserved  heterogeneity  term  in  this 
way?  If  the  logit  model  is  derived  from  the  random  utility  model  with  extreme 
value  errors,  how  does  the  random  intercept  affect  that  interpretation  and/or 
derivation?  [See  Revelt  and  Train,  1998.] 

(c)  Suggest  a  suitable  distributional  assumption  for  the  random  intercept; 
rewrite  the  likelihood  function  conditional  on  unobserved  heterogeneity.  Next 
write  down  the  likelihood  function  with  unobserved  heterogeneity  integrated 
out. 

(d)  Describe  in  a  step-by-step  manner  how  to  use  the  maximum  simulated  like¬ 
lihood  estimation  procedure  to  estimate  this  model.  Explain,  with  details, 
how  to  calculate  the  variance  matrix  of  unknown  parameters.  How  would 
you  decide  how  many  simulations  you  will  use? 

(e)  Consider  the  method  of  simulated  moments  as  an  alternative  to  the  MSL 
procedure  for  the  random  parameter  logit.  Write  down  the  moment  condi¬ 
tion^)  conditional  on  the  unobserved  heterogeneity  term.  Then  outline  an 
MSM  estimation  procedure  for  this  model. 

1 2-8  Some  computing  packages  allow  you  to  draw  both  Poisson  and  Gamma  pseudo¬ 
random  numbers  directly.  It  is  also  known  that  the  negative  binomial  distribution 
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can  be  derived  as  a  mixture  of  Poisson  and  gamma  random  variables  (see  Sec¬ 
tion  20.4). 

(a)  Write  down  a  procedure  for  drawing  negative  binomial-distributed  variables 
using  the  method  of  mixtures. 

(b)  Apply  your  method  by  drawing  a  sample  of  10,000  on  a  Poisson-distributed 
variable  with  mean  0.25. 

(c)  Draw  a  corresponding  sample  from  a  Gamma  distribution  with  mean  1  and 
variance  a ,  with  a  set  to  produce  negative  binomial  random  variables  with 
variance  0.3125. 
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CHAPTER  13 


Bayesian  Methods 


13.1.  Introduction 

This  chapter  serves  as  an  introduction  to  Bayesian  econometrics.  Bayesian  regres¬ 
sion  analysis  has  grown  in  a  spectacular  fashion  since  the  publication  of  books  by 
Zellner  (1971)  and  Learner  (1978).  Application  to  routine  data  analysis  has  also  ex¬ 
panded  enormously,  greatly  aided  by  revolutionary  advances  in  computer  hardware 
and  software  technology.  In  the  light  of  such  major  developments,  a  single  chapter 
can  never  do  adequate  justice  to  the  many  facets  of  this  subject.  This  chapter  therefore 
has  the  very  modest  goal  of  providing  a  rough  road  map  to  the  major  ideas  and  devel¬ 
opments  in  Bayesian  econometrics.  Despite  this  modest  objective  some  parts  are  still 
quite  technical. 

The  Bayesian  approach,  unlike  the  likelihood  or  frequentist  or  classical  approach 
presented  in  previous  chapters,  requires  the  specification  of  a  probabilistic  model  of 
prior  beliefs  about  the  unknown  parameters,  given  an  initial  specification  of  a  model. 
Many  researchers  are  uncomfortable  about  this  step,  both  philosophically  and  practi¬ 
cally.  This  has  traditionally  been  the  basis  of  the  concern  that  the  Bayesian  approach 
is  subjective  rather  than  objective.  It  will  be  shown  that  in  large  samples  the  role  of 
the  prior  may  be  negligible,  that  relatively  uninformative  priors  can  be  specified,  and 
that  there  are  methods  available  for  studying  the  sensitivity  of  inferences  to  priors. 
Therefore,  the  charge  of  subjectivity  may  not  always  be  as  serious  as  many  claim. 

Bayesian  approaches  play  a  potentially  large  role  in  applied  microeconometrics, 
especially  when  dealing  with  complex  models  that  lack  analytically  tractable  likeli¬ 
hood  functions.  Chapter  12  introduced  simulation-based  methods  for  such  situations. 
These  methods,  particularly  simulated  likelihood,  are  potentially  problematic  as  they 
generally  require  maximization  of  a  function  using  a  sufficiently  large  number  of  sim¬ 
ulation  draws  that  increases  at  an  appropriate  rate  as  the  sample  size  grows.  Even  with 
today’s  powerful  computers,  analysis  of  large  samples  and  high-dimensional  models 
can  require  a  formidable  amount  of  computation.  Bayesian  methods,  in  contrast,  do 
not  require  maximization  algorithms.  Bayesian  procedures  are  flexible  enough  to  be 
adapted  to  produce  estimates  that  are  excellent  (if  not  perfect)  substitutes  for  maximum 
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likelihood  estimates,  which  are  obtained  in  many  cases  more  efficiently.  Indeed,  it  is 
not  necessary  that  one  goes  through  a  philosophical  conversion  to  use  these  proce¬ 
dures;  they  can  be  adapted  for  pragmatic  reasons. 

The  foregoing  remarks  do  not  mean  that  Bayesian  procedures  do  not  have  a  deeper 
rationale  and  justification.  They  do.  Three  features  in  particular  deserve  to  be  men¬ 
tioned.  First,  Bayesian  procedures  can  yield  the  entire  posterior  distribution  of  the 
parameters  of  interest,  leaving  the  user  to  decide  which  moment  or  quantile  of  the 
distribution  to  report,  potentially  on  the  basis  of  decision-theoretic  criteria.  One  does 
not  need  separate  estimators  for  means,  medians,  quantiles,  and  so  forth  as  the  pos¬ 
terior  distribution  has  them  all!  Second,  Bayesian  analysis,  being  conditional  on  the 
data,  yields  exact  finite-sample  results,  obviating  the  need  for  finite-sample  corrections 
or  adjustments.  This  distribution  approaches  the  normal  distribution  in  large  samples 
where  the  influence  of  the  priors  vanishes.  Third,  Bayesian  methods  provide  a  natural 
way  to  select  models. 

Section  13.2  introduces  the  basic  concepts  and  components  of  Bayesian  analysis 
and  the  key  properties  of  Bayesian  estimators.  These  ideas  are  illustrated  in  Section 
13.3  for  the  relatively  tractable  linear  regression  model.  More  generally,  no  closed- 
form  solution  exists  for  the  posterior  distribution.  Section  13.4  presents  Monte  Carlo 
integration  methods,  notably  importance  sampling,  used  to  obtain  numerical  estimates 
of  posterior  moments.  Section  13.5  details  Markov  chain  Monte  Carlo  methods,  no¬ 
tably  Gibbs  sampling  and  the  Metropolis-Hastings  algorithm,  used  to  obtain  draws 
from  the  (intractable)  posterior  distribution.  An  example  of  these  methods  is  given  in 
Section  13.6.  The  additional  topics  of  data  augmentation  and  Bayesian  model  selection 
are  presented  in  Sections  13.7  and  13.8. 


13.2.  Bayesian  Approach 

In  the  Bayesian  approach  uncertainty  about  the  value  of  the  parameters  6  is  explicitly 
modeled  by  introducing  a  density  tc(O)  for  the  prior  distribution,  so  named  because  it 
is  specified  without  considering  the  data  currently  in  hand.  It  expresses  subjective  be¬ 
liefs  about  the  true  unknown  parameter  in  the  language  of  probability.  Specification  of 
the  prior  is  studied  in  detail  in  Section  13.2.4.  As  an  example,  suppose  that  6  is  an  in¬ 
come  elasticity  and  on  the  basis  of  an  economic  model  or  previous  studies  it  is  felt  that 
0  lies  between  0.8  and  1.2  with  probability  0.95.  Then  a  prior  for  9  is  6  ~  A/"[l,  0.12]. 

The  other  ingredient  of  Bayesian  inference  is  the  sample  joint  density  or  likelihood 
f(y\9),  where  in  the  single-equation  case  y  is  an  N  x  1  vector.  Dependence  on  re¬ 
gressors  is  suppressed  throughout  this  section,  for  notational  simplicity.  Exogenous 
regressors  are  introduced  in  Section  13.3,  in  which  case  f(y\0)  becomes  /(y|X,  9) 
and  Bayesian  analysis  is  then  conditional  on  regressors.  Note  also  that  in  this  chapter 
/(•)  usually  denotes  the  joint  density  of  all  observations,  rather  than  the  density  of  the 
ith  observation. 

If  no  data  are  available  then  all  we  have  is  the  prior.  After  data  are  observed,  the  clas¬ 
sical  approach  is  to  estimate  the  unknown  parameter  9  using  the  maximum  likelihood 
principle.  The  Bayesian  approach  instead  combines  the  likelihood  of  the  sample  with 


420 


13.2.  BAYESIAN  APPROACH 


the  prior,  reflecting  the  view  that  any  prior  information  should  be  exploited,  even  if  it 
is  in  the  form  of  a  probability  distribution.  This  process  can  be  thought  of  as  a  revision 
of  the  prior  given  the  data  (likelihood).  Indeed,  we  can  derive  a  distribution  of  9  after 
combining  the  likelihood  and  the  prior.  The  resulting  distribution  is  called  a  posterior 
distribution,  and  it  reflects  the  investigator’s  beliefs  about  9  a  posteriori,  that  is,  after 
observing  the  data. 


13.2.1.  Bayes’  Theorem 


The  basic  result  that  delivers  the  posterior  distribution  is  Bayes’  Theorem,  also  re¬ 
ferred  to  sometimes  as  Bayes’  inverse  law  of  probability,  that 


f(9\y)  = 


f(j\0)lt{6) 

/( y) 


(13.1) 


where  /( y)  denotes  the  marginal  probability  distribution  of  y,  formally  defined  as 


/( y)=  I  f(y\e)7t(d)dd,  (13.2) 

R(8) 


where  R  (9)  denotes  the  support  of  n(9).  This  result  is  obtained  by  noting  that,  for 
events  A  and  B,  the  conditional  probability 


Pr[A|fi]  = 


Pr[A  n  B ] 
Pr[B] 

Pr[B|A]Pr[A] 

Prj\S] 


where  the  second  equality  follows  because  Pr[5|  A]  =  Pr[A  n  B ]/  Pr[A]. 

Because  the  denominator  /( y)  in  (13.1)  is  free  of  9,  we  can  more  simply  write 
p(9 |y)  as  proportional  to  the  product  of  the  pdf  and  the  prior;  thus 


P(9 ly)  cx  L(y| 9)jt(9).  (13.3) 

This  simplifies  derivation  and  representation  of  the  posterior,  by  omitting  inessential 
constants  that  can  be  recovered  later,  as  will  be  illustrated  in  Section  13.2.2.  When  a 
density  function  is  written  without  normalizing  constants  it  is  referred  to  as  a  density 
kernel. 

In  many  cases  (13.1)  or  (13.3)  do  not  yield  a  closed-form  expression  for  the  pos¬ 
terior  density.  A  closed-form  expression  is  not  needed,  however,  and  later  sections 
present  recent  simulation-based  techniques  for  obtaining  good  numerical  approxima¬ 
tions  to  the  posterior  density.  These  techniques  permit  Bayesian  analysis  for  almost 
any  parametric  microeconometrics  application. 

It  is  common  to  use  a  special  symbol  for  the  posterior  density,  so  we  will  replace 
f{9 |y)  by  p(9 |y).  Also,  the  original  joint  density,  f(y\9)  is  the  likelihood  function 
L(y|  9).  Henceforth  we  will  write  the  posterior  density  as 


p(9 |y)  oc  L(y|0)jr(0). 


(13.4) 


This  representation,  the  key  one  for  the  Bayesian  approach,  emphasizes  an  impor¬ 
tant  difference  between  the  frequentist  and  Bayesian  approaches.  In  the  frequentist 
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approach,  the  true  value  of  the  parameter  is  constant  but  parameter  estimates  are  treated 
as  random  variables.  In  contrast,  in  the  Bayesian  approach  the  parameter  is  treated  as 
if  it  is  random. 


13.2.2.  Bayes’  Theorem  Example 

Suppose  y  ~  J\f[9,  a2],  where  a2  is  known  but  the  scalar  parameter  9  is  unknown. 
Given  a  random  sample  (yi, . . . ,  y n),  the  joint  density  of  y  is 


L(y|0)  =  Y\  (2tto-2)  1/_  exp  {-  (>',  -  9f  /2a1 


i=  1 


=  (27TCT2)  exp  |  ^  (y,  -  Of  /2a2 

N 


1  =  1 


oc  exp 


2(7  2 


(y  -  of 


where  y  =  N  y,,  and  we  use  (y,-  -  df  =  (y*  -  y  +  y  -  Of  = 

Hi  ( yi  —  >’)2  +  HiO  ~  •  Multiplicative  terms  not  involving  6,  which  are 

absorbed  in  the  constant  of  proportionality,  are  dropped.  The  frequentist  approach 
maximizes  the  log-likelihood  with  respect  to  9,  leading  to  the  MLE  9  =  y. 

The  Bayesian  approach  additionally  specifies  a  prior  for  9.  An  analytically  conve¬ 
nient  choice  is  the  normal  prior,  with  9  ~  A/”[/r,  r2],  where  we  suppose  that  values 
of  the  prior  mean  /i  and  prior  variance  r2  are  specified.  A  large  value  of  r2  indicates 
greater  prior  uncertainty  than  a  small  value.  Then  the  prior  density  is 

tt(0)  =  (27TT2)  1/2  exp  {—  (9  —  /J.)2  /2r2} 
oc  exp  {—  (9  —  /if  /2r2}  , 

where  (27rr2)  which  is  free  of  9,  is  absorbed  into  the  factor  of  proportionality. 
Using  (13.4),  we  obtain  the  posterior  density 


P(0  ly)  = 


Uy\9)jt{9) 

IZ0Uy\9)jt{9)d9’ 


— oo  <  9  <  oo. 


(13.5) 


The  denominator  ensures  that  the  posterior  is  proper  (i.e.,  it  integrates  to  1).  For  some 
purposes  the  denominator  can  be  ignored,  in  which  case  we  work  with  p(0  |y)  oc 
h{y\9)n{9).  The  numerator  can  be  expanded  as  follows: 


Uy\9)Tt{9) 

=  (27 r a2)  N/2  exp 


^  (yi  -  ef  \ 

h  2 1 


=  (27T)^N+1)/2(a2rN/2(r2r1/2  exp 


to  2\  — !/2 
(27rr  )  exp 


(0  -  df  1 

2r2  j 


1  N 

"2^,5(‘V' 


Of 


(0  -  df  | 

2r2  ]  ’ 
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Because 


Y  cv.-  -of  =  Y  ~  -v  )2  +  ^  -v  -  0)2> 

1  =  1  1  =  1 

and  noting  that  the  constant  of  integration  in  (13.5)  and  other  multiplicative  constants 
independent  of  6  can  be  absorbed  into  the  proportionality  constant,  we  have 

\2  ' 

i  /v  ~  i  i  i  i  n  —  //, 

P(0 |y)  oc  exp  - 


N  _  ,  „ 

~^a2(6  ~  yf  \  exP 


1  (0  -  M)2 


oc  exp  I  —  ^ 


(e-nY  (y-6) 

A  I 


2  r2 

2' 


N~lo2 


oc  exp 


(0  -  AO)2 


The  last  line  is  the  kernel  of  r2]  distribution,  where 

Mi  —  ri  ( Ny/a 2  +  /r/r2) , 


r2  =  (N/cr2  +  1/r2)  1  . 


(13.6) 


(13.7) 


(13.8) 


The  final  line  in  (13.7)  is  obtained  by  completing  the  square,  using  the  result  that  for 
arbitrary  scalars  z,  y,  ci\,  ci2 ,  c i,  and  C2,  we  have 


ci(z  -  ai)2  +  c2(z  -  a2)2  —  (ci  +  c2)  (z  -  (  1  — '^r\)  +  ~ — “  fl2)2, 

V  \(ci  +  c2)-//  (ci+c2) 

where  z  =  9,  a\  =  n,  a2  =  y,  c\  =  1/r2,  and  c2  =  1  /(AV^'cr2  +  r2).  The  terms  free 
of  9  are  dropped. 

In  summary,  we  have  the  following: 

Data:  y\9  ~  JV[9,  cr2],  cr2  known. 

Prior:  6  ~  7V"[/i,  r2],  fi,  r2  specified. 

Posterior:  0|y  ~  r2],  AO,  t2  given  in  (13.8). 

The  posterior  mean  ji  \  is  a  weighted  sum  of  the  prior  mean  At  and  the  sample  mean  y 
with  weights  that  reflect  the  precision  of  the  likelihood  via  a2/N  and  the  prior  via  r2. 
Bayesian  practice  is  to  summarize  variability  using  the  precision  parameter,  defined 
as  the  reciprocal  of  the  variance.  Here  the  posterior  precision  rf2  is  the  sum  of  the 
sample  precision  of  y,  N/cr2,  and  the  prior  precision  1/r2,  so  precision  is  increased 
by  pooling  the  sample  and  prior  information. 

If  the  prior  information  is  imprecise,  so  that  1  / r2  is  small,  then  the  weight  assigned 
to  the  prior  mean  is  also  small  relative  to  the  sample  information  and  the  prior  plays 
a  minor  role  in  generating  the  posterior.  Similarly,  the  sample  information  also  dom¬ 
inates  as  the  sample  size  gets  large,  since  then  N/cr2  gets  large  relative  to  1/r2.  The 
posterior  distribution  tends  to  the  familiar  asymptotically  normal,  except  the  Bayesian 
result  is  that  6  ~  _A/[y,  cr2/N]  rather  than  y  ~  J\f[6,  a2 /N], 


423 


BAYESIAN  METHODS 


Bayes:  Likelihood,  Prior  and  Posterior 


Evaluation  point 

Figure  13.1:  Bayesian  analysis  for  mean  parameter  of  normal  density:  plot  of  normal  likeli¬ 
hood  (right),  normal  prior  density  (left),  and  resulting  posterior  density  (center). 

As  a  concrete  example,  suppose  a 2  =  100,  the  prior  sets  \x  ,  =  5  and  r2  =  3,  and  a 
sample  of  size  N  =  50  has  sample  mean  y  =  10.  Then  the  likelihood  is  A/"[  1 0,  2],  the 
prior  is  A/”[5,  3],  and  from  (13.7)  and  (13.8)  the  posterior  is  A/”[8,  1.2].  These  densities 
are  plotted  in  Figure  13.1.  The  posterior  mean  lies  between  the  prior  mean  and  the 
sample  mean,  whereas  the  posterior  has  variance  that  is  smaller  than  the  variance  of 
both  the  prior  and  the  likelihood. 


13.2.3.  Bayesian  and  Non-Bayesian  Approaches  Compared 

It  is  useful  to  draw  parallels  and  contrasts  between  the  frequentist  and  Bayesian 
approaches. 

In  a  parametric  frequentist  formulation  the  likelihood  function  is  the  main  ba¬ 
sis  of  statistical  inference.  Under  suitable  regularity  conditions  the  MLE  is  consis¬ 
tent  and  asymptotically  normal.  Sampling  theory  of  estimators  provides  a  basis  for 
probability  statements  about  the  estimated  magnitudes,  or  functions  thereof,  or  con¬ 
ditional  prediction.  Prior  information  on  parameters  is  incorporated  by  restricted  ML 
estimation. 

In  a  Bayesian  analysis,  summarized  in  Table  13.1,  the  data-generating  process  and 
the  data  are  combined  with  a  prior  distribution  on  the  parameters.  Specification  of  this 
prior  distribution  is  discussed  in  detail  in  Section  13.2.4.  The  prior  embodies  prob¬ 
abilistically  specified  information  before  the  current  data  are  analyzed  and  may  be 
based  on  “received  information.”  The  prior  information  and  the  data  are  combined 
using  Bayes’  Theorem. 

The  outcome  of  this  exercise  is  the  posterior  distribution  of  the  parameters  9,  which 
we  may  think  of  as  the  translated  likelihood  function.  Alternatively,  given  the  data,  the 
posterior  distribution  reflects  our  “revised  prior.”  If  the  sample  is  small,  and  perhaps 
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Table  13.1.  Bayesian  Analysis:  Essential  Components 


Component 

Formula 

Sampling  model 

(>'i . .  }’n)  iid  from  f(y\0) 

Joint  density /likelihood 

f(y \0), 

L(y|0);  G  e  © 

Prior  distribution 

jr(6),  t 

)  €  © 

f  =  /(y|0M0)/  /  f{y\G)jz{e)dO 

Posterior  density 

p<fi ly) : 

oc 

l« 

/( y|0)JT(0) 

L(y|0)7r(0) 
parameter  estimation 

Posterior  pdf  — »•  posterior  inference  - 

-»  ■ 

probability  statements 

prediction 

model  comparison 

relatively  uninformative,  the  posterior  may  look  like  the  prior,  but  if  the  sample  is 
large,  the  posterior  distribution  will  reflect  the  features  of  the  data. 


13.2.4.  Specification  of  the  Prior 

Bayesian  analysis  requires  specification  of  the  dgp  f(y\0)  and  of  the  prior  n(0).  The 
dgp  is  usually  specified  to  be  the  same  as  that  used  in  a  fully  parametric  likelihood- 
based  analysis.  For  binary  outcomes  a  logit  or  probit  model  might  be  specified,  for 
count  data  the  Poisson  or  negative  binomial  model  would  be  specified,  and  so  on. 

The  principle  challenge  introduced  by  Bayesian  analysis,  compared  to  classical 
analysis,  is  the  need  to  additionally  specify  a  prior  distribution.  Results  can  vary  with 
the  choice  of  prior,  as  different  priors  lead  to  different  posterior  distributions  unless 
the  sample  is  large  enough  that  the  sample  information  dominates. 

One  approach  is  to  choose  a  prior  such  that  it  has  little  impact  on  the  posterior, 
so  that  results  essentially  are  based  on  the  sampled  data.  An  alternative  approach, 
warranted  when  strong  prior  information  is  available,  is  to  specify  a  prior  that  reflects 
this  information.  Both  approaches,  especially  the  latter,  were  historically  constrained 
by  issues  of  tractability  of  the  resulting  posterior,  but  this  has  now  become  much  less  of 
a  consideration  given  recent  computational  advances.  A  popular  intermediate  approach 
is  to  use  hierarchical  priors,  with  uncertainty  about  parameters  expressed  in  terms  of 
probability  functions  that  themselves  involve  other  parameters  about  which  we  are  also 
uncertain. 


Noninformative  Priors 

A  noninformative  prior  is  one  that  has  little  impact  on  the  resulting  posterior 
distribution. 

The  obvious  way  to  try  to  obtain  a  noninformative  prior  is  to  use  a  uniform  prior 
with  7 x{9)  =  c  for  all  6,  where  c  >  0  is  a  constant,  since  this  places  equal  weight  on 
all  possible  values  of  0. 
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One  disadvantage  of  the  uniform  prior  is  that  if  it  is  used  in  settings  where  the  pa¬ 
rameters  9  are  unbounded  then  the  prior  is  an  improper  density  because  then  neces¬ 
sarily  f  jz(9)d9  =  oo.  The  resulting  posterior  distribution  may  then  also  be  improper, 
though  in  several  leading  examples  the  posterior  is  nonetheless  proper. 

Another  disadvantage  of  the  uniform  prior  is  that  it  is  not  invariant  to  reparameteri¬ 
zation.  For  example,  for  a  scalar  parameter  9  >  0  an  alternative  obvious  parameteriza¬ 
tion  of  the  density  of  y  is  in  terms  of  the  parameter  y  =  In  d,  as  then  —  oo  <  y  <  oo. 
If  9  has  a  uniform  prior,  jt(9)  =  c,  then  the  corresponding  prior  tt *(y)  for  y  is  not  the 
uniform  since  jr*(y)  =  Jt(9)  \d9/dy\  =  ceY .  Although  seemingly  uninformative  for 
one  parameterization,  the  prior  is  informative  in  another  parameterization. 

The  uniform  prior  can  be  emulated  by  specifying  a  proper  prior  that  has  very  large 
variances.  For  example,  suppose  the  scalar  9  has  Af[/x.  r2]  prior,  where  r2  is  very 
large.  Then  for  values  of  9  likely  to  be  supported  by  the  data  the  prior  Jt(9)  —  I  / 
(2.7t r2),  a  constant,  because  exp  [—(0  —  /x) /2r2]  ~  1.  It  is  important  to  note  that  this 
obvious  approach,  called  a  vague  or  diffuse  or  flat  prior,  has  the  same  weakness  as 
the  uniform  prior.  It  is  not  invariant  to  reparameterization. 

Instead,  a  widely  used  noninformative  prior  is  Jeffreys’  prior, 

tt(0)  oc  |X(0)|1/2,  (13.9) 


where  for  a  vector  6,  \T{9)\  is  the  determinant  of  the  information  matrix  1  (9)  = 
— E[()2£/30  30']  with  C  =  In  L(y|6t).  Jeffreys’  prior,  named  after  the  pioneering 
Bayesian  Harold  Jeffreys,  has  the  property  of  invariance  to  reparameterization  or 
transformation  of  model  parameters,  so  that  same  prior  information  is  being  given 
regardless  of  the  particular  parameterization  chosen. 

To  verify  Jeffrey’s  rule,  for  simplicity  consider  the  scalar  parameter  case.  Given 
transformation  y  =  h(9),  dC/dy  =  dL/39  x  d9/dy  and 

d2C  _  d2C  /3<9\2  dC  d20 

9xI_902"V9y/  +~d9~dy2' 

Taking  expectations  with  respect  to  the  sample  density  and  noting  that  E[3£/3(9]  =  0 
by  the  property  of  likelihood  scores  yields 


l(y)  =  l(e) 


2 


It  follows  that 

\l(y)\l'2  =  \iml/2 

In  general  the  prior  Jt{9)  for  9  implies  the  prior  for  y  is  7r*  (y)  =  n(9)  x  \d9/dy\.  Spe¬ 
cializing  to  prior  (13.9),  we  have  i r*(y)  oc  |X(d)|1/2  x  \d9/dy\,  but  this  is  [T(y)|1/2 
as  desired. 

As  an  example,  suppose  y  ~  M  [/r,  cr 2],  and  consider  three  cases.  First,  if  /z  is  the 
unknown  parameter  and  a 2  is  known,  then  the  information  measure  for  fi  is  T  (ji)  = 
N/cr2,  and  Jeffrey’s  prior  |I  (/z)| 1/2  oc  c,  a  constant  since  here  a2  is  known.  Note  that 
this  prior  is  an  improper  prior.  Second,  if  a 2  is  unknown  and  /i  is  known,  then  the 


3(9 
3  Y 
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information  measure  for  a 2  is  X  (cr2)  =  iV/(2cr4),  and  Jeffrey’s  prior  1 X  (cr2)]1^  oc 
a2.  Third,  if  both  /x  and  a2  are  unknown  then  the  information  matrix  |X  (/x.  cr2)  |  = 
(iV/cr2)  (iV/2cr4)  =  N2/2o6.  Therefore,  Jeffreys’  rule  implies  that  the  joint  prior 
7 r  (/x.  cr2)  oc  a  Note  that  this  is  different  from  what  we  get  if  we  apply  Jeffreys’ 
rule  to  the  separate  priors  for  /x  and  cr2,  as  jt  ( /x )  oc  c  and  tt  (cr2)  oc  rr  2  yields 
it  (/u.)  7r(cr2)  oc  cr*2. 

Jeffreys’  rule  can  serve  as  a  method  of  generating  a  prior  when  there  are  no  obvious 
candidate  priors  available.  However,  the  literature  does  not  seem  to  have  resolved  the 
issue  of  whether  the  rule  produces  a  noninformative  prior  and  if  so  in  what  sense. 
Further,  as  is  clear  from  the  preceding  example  Jeffreys’  prior  can  be  improper,  which 
may  lead  to  an  improper  posterior. 


Conjugate  Priors 

When  a  proper  prior  is  specified,  either  as  an  informative  prior  or  as  a  diffuse  prior,  it 
is  convenient  to  choose  a  functional  form  for  the  prior  that,  given  the  specified  sample 
density  for  the  data,  leads  to  a  “nice”  analytically  tractable  expression  for  the  posterior, 
such  as  (13.7). 

Such  tractable  results  most  often  arise  if  the  sample  and  prior  densities  form  a  nat¬ 
ural  conjugate  pair,  defined  as  having  the  property  that  sample  density  and  prior  and 
posterior  distributions  all  lie  in  the  same  class  of  densities.  Then  the  prior  is  called 
a  natural  conjugate  prior.  Section  13.2.2  gave  an  example,  where  for  normally  dis¬ 
tributed  data  a  normal  prior  for  the  mean  leads  to  a  posterior  that  was  also  normal. 

The  exponential  family  is  essentially  the  only  class  of  densities  to  have  natural 
conjugate  priors.  A  one-parameter  member  of  the  exponential  family  has  a  density 
that  for  a  single  observation  can  be  expressed  as 

f(y\6)  =  expMd)  +  b(y)  +  c(6)u{y)}  (13.10) 

c x  exp{n(d)  +  c{9)u(y)}, 

where  different  functions  a(-),  c(-),  and  u(-)  lead  to  different  densities  in  the  family,  and 
/;(■)  is  a  normalizing  constant.  For  example,  setting  c(0 )  =  /x /a2,  a(6)  =  —  /x2/2cr2, 
and  u(y)  =  y  yields  the  kernel  of  the  Ar[/x.  a2  ]  distribution  (for  a2  known).  Note 
that  setting  u(y)  =  y  yields  the  linear  exponential  family,  presented  in  some  detail  in 
Section  5.7.3.  More  generally,  if  6  is  a  vector  then  c(9)u(y)  is  replaced  by  c(0),u(y), 
where  usually  u(-)  has  the  same  dimension  as  9. 

For  a  random  sample  of  size  N  the  exponential  family  leads  to  sample  density 

L(y|0)  cx  exp{Nfl(0)  +  c(0)f(y)},  (13.11) 

where  t( y)  =  JT  wt  v, ).  Consider  the  following  prior  on  6: 

7 x  (9\fi,  a)  cc  exp  {fia  ( 9 )  +  ac(9)} ,  (13.12) 

where  a  and  ji  are  specified  parameters  of  the  prior  and  the  functions  a  (•)  and  c(-)  are 
the  same  as  those  in  (13.10).  This  density  is  an  exponential  family  density  for  6  once 
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Table  13.2.  Conjugate  Families:  Leading  Examples 


Distribution 

Sample  Density 

Conjugate  Prior  Density 

Normal 

A f[9,(T2] 

9  ~ 

'Af[/x,r2] 

Normal 

U[,x,l/92] 

9  ~ 

'  G  la,  0] 

Binomial 

B[N,0] 

9  ~ 

-  Beta[a,  0] 

Poisson 

V[6] 

9  - 

-  G[a,0] 

Gamma 

Q  [v,  0} 

9  - 

-  G[a,0] 

Multinomial 

MU  [0u...,0k] 

Ou 

. . .  ,9k  ~  Dirichlet[oii , . . . ,  a*-] 

a  is  viewed  as  fixed.  Applying  Bayes’  Theorem  and  simplifying,  we  get 

p{9  |y)ccL(y|0);r(0|/8,a)  (13-13) 

oc  exp  {(yS  +  N)a{9)  +  (a  +  t(y))c(0)} , 

which  is  readily  verified  to  have  the  same  kernel  as  the  original  prior  in  (13.12).  Com¬ 
parison  of  the  posterior  with  the  sample  density  reveals  that  the  prior  is  treated  as 
providing  an  additional  fj  observations  yp,  say,  with  t(yp)  =  a. 

Table  13.2  presents  some  standard  conjugate  families,  where  the  relevant  densi¬ 
ties  are  provided  in  Appendix  B.  The  gamma  includes  exponential  and  chi-square  as 
special  cases.  Negative  binomial,  uniform,  and  Pareto  likelihoods  also  have  conjugate 
prior  densities. 

An  attraction  of  a  conjugate  prior  is  the  resulting  computational  and  analytical  sim¬ 
plicity.  Nevertheless,  using  a  conjugate  prior  is  a  restriction  and  the  justification  for 
imposing  it  is  less  compelling  now  than  it  was  in  the  past  when  computational  re¬ 
sources  available  to  a  typical  researcher  were  rather  limited. 

Another  advantage  of  having  a  posterior  that  is  in  the  same  class  as  the  prior  is  that 
the  posterior  can  easily  replace  the  prior  as  a  new  (data-based)  prior  for  a  later  analysis. 
If  a  prior  is  to  be  interpreted  as  “received  information,”  then  one  may  take  the  posterior 
from  one  investigation  as  a  prior  for  the  next. 


Hierarchical  Priors 

Hierarchical  priors  are  those  that  arise  when  the  parameters  in  a  prior  are  themselves 
modeled  as  having  a  distribution.  The  parameters  that  appear  in  such  a  “prior  on  a 
prior”  are  called  hyperparameters. 

The  data  have  joint  density  L(y|0),  as  in  Section  13.2.1,  but  now  the  prior  on  6 
depends  on  parameters  r,  say,  that  are  random  rather  than  fixed.  Thus  the  prior  on 
6  is  7t(G\t),  where  the  parameters  r  in  turn  have  a  prior  tt(t).  The  joint  prior  is 
tt(G,  t)  =  and  Bayes’  rule  yields  the  joint  posterior 

p(6,  r|y)  oc  L(y\G)n{G\T)n{T). 
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Interest  will  usually  lie  in  the  marginal  posterior  for  6,  which  is  obtained  by  inte¬ 
grating  the  joint  posterior  with  respect  to  r.  The  specified  parameters  of  the  prior 
7t(t)  are  called  hyperparameters.  Alternatively,  these  parameters  in  turn  can  be 
given  a  prior,  in  which  case  another  hierarchical  level  is  introduced  leading  to  joint 
prior  and  so  on.  Recent  advances  in  computational  methods  for 

Bayesian  analysis,  particularly  the  Gibbs  sampler,  are  well  suited  to  hierarchical  priors 
because  of  their  recursive  structure. 

Hierarchical  priors  can  be  viewed  as  a  Bayesian  analogue  of  random  coefficient 
models  in  a  classical  setting.  For  example,  for  iid  count  data  we  might  suppose  that 
Vi  ~  V  [0,  j.  where  the  Poisson  parameter  is  now  random.  A  convenient  distribution 
for  6j  is  the  conjugate  gamma  distribution,  so  0,-  ~  Q  [a,  jJ>\.  The  classical  approach 
estimates  a.  and  ji  by  maximum  likelihood.  A  nonhierarchical  Bayesian  model  speci¬ 
fies  values  for  a  and  j3  and  obtains  the  posterior  for  0,  .  A  hierarchical  Bayesian  model 
specifies  priors  for  a  and  fi,  such  as  the  gamma  that  is  conjugate,  and  first  obtains  the 
joint  posterior  for  0,,  a,  and  /3  before  finding  the  marginal  posterior  for  0,  . 

Hierarchical  priors  arise  naturally  in  the  context  of  hierarchical  models,  also 
known  as  multilevel  models.  Such  models  are  widely  applied  in  classical  settings 
using  special  purpose  software  (Bryk  and  Raudenbusch,  1992,  2002).  An  early  con¬ 
tribution  by  Lindley  and  Smith  (1972)  analyzed  hierarchical  regression  models  in  a 
Bayesian  setting.  Hierarchical  modeling  has  a  natural  appeal  when  the  data  to  be  an¬ 
alyzed  naturally  fall  into  strata,  groups,  or  layers,  and  further  one  may  expect  to  see 
groupwise  parameter  variation  in  the  relationship  of  interest.  For  example,  observa¬ 
tions  on  test  scores  could  come  from  students  in  specific  grades  and  schools.  Modeling 
of  test  scores  could  involve  individual  characteristics  that  by  definition  vary  across  in¬ 
dividuals,  class  characteristics  that  vary  across  grades,  and  school  characteristics  that 
only  vary  across  schools.  Because  such  data  will  involve  clustering  of  observations, 
this  topic  is  also  discussed  in  Chapter  24.  Such  models  also  have  a  close  relationship 
with  random  effects  formulation  for  panel  data. 

As  an  example,  suppose  that  data  naturally  fall  into  J  groups,  and  that  the  pop¬ 
ulation  mean  of  y  differs  across  the  groups.  For  individual  i  in  group  j  suppose 
yij  ~  A f[8j,  cr2],  where  for  simplicity  we  assume  a2  is  known.  Then  the  sample  mean 
in  the  y'th  group  yj  ~  A f[9j,  cr2/Nj],  where  Nj  denotes  the  number  of  individuals  in 
the  group  and  independence  is  assumed.  A  hierarchical  model  specifies  the  means  9j 
to  have  prior  6j  ~  Afl/i,  r2],  for  example,  where  additional  priors  are  specified  for  the 
parameters  //  and  r2  of  the  higher  level  prior. 


Sensitivity  Analysis 

In  a  frequentist  analysis  one  may  entertain  a  variety  of  exact  prior  restrictions  in  for¬ 
mulating  a  model  for  estimation.  For  example,  a  model  may  be  estimated  under  one 
or  more  sets  of  restrictions,  and  the  results  can  be  compared  to  form  an  idea  of  the 
sensitivity  of  the  estimates  to  prior  assumptions. 

The  same  logic  and  approach  applies  in  Bayesian  analysis.  One  need  not  take  the 
prior  to  be  literally  true,  and  one  can  perform  a  sensitivity  analysis  that  studies  how  the 


429 


BAYESIAN  METHODS 


posterior  changes  with  different  choice  of  prior.  Similarly,  one  can  vary  assumptions 
about  the  dgp  and  see  how  posterior  beliefs  change  in  response. 

13.2.5.  Densities  and  Measures  Related  to  the  Posterior 

Bayesian  analysis  is  based  on  the  posterior  distribution.  For  convenience  Bayesian  re¬ 
gression  results  usually  report  only  summary  measures,  such  as  posterior  moments, 
quantiles,  or  marginal  distributions  of  components  of  0.  However,  the  posterior  distri¬ 
bution  is  also  used  for  prediction  and  probability  statements,  detailed  in  this  Section, 
and  for  model  comparison,  presented  in  Section  13.8. 

Several  quantities  play  an  important  role  in  a  Bayesian  analysis. 

Marginal  Posterior 

In  general  0  is  multidimensional,  denoted  by  9  =  (0\ ,  . . .  ,0q)  and  interest  may  lie 
in  the  posterior  distribution  of  individual  components  of  9.  The  marginal  posterior 
density  of  the  A  th  parameter,  9k,  is  obtained  by  integrating  out  of  the  joint  posterior 
all  the  remaining  (q  —  1)  elements  of  9.  Formally,  this  is  denoted  as  p(9k\y)  and  is 
obtained  by  calculating  the  (q  —  l)-fold  integral 

P(0r-|y)  =  J  p(0u  ■  ■  ■  ,9p\y)d9l..d6k-ld9k+l..d9q  (13.14) 

=  J  p(9\y)d9_k, 

where  the  more  compact  notation  in  the  second  line  contains  9  which  means  all 
elements  of  9  other  than  6k-  The  marginal  posterior  density  is  usually  asymmetric  and 
need  not  be  unimodal,  whereas  the  asymptotic  normal  distribution  for  classical  esti¬ 
mators  is  symmetric  and  unimodal.  It  can  be  useful  to  graph  the  posterior,  especially 
if  it  departs  considerably  from  a  symmetric  unimodal  distribution. 

Posterior  Moments 

Classical  regression  output  reports  the  parameter  estimate  and  standard  error.  For 
Bayesian  regression  one  can  similarly  report  the  mean  or  median  and  the  standard 
deviation  of  the  marginal  posterior  density  of  each  parameter. 

Point  Estimation 

In  classical  analysis  there  is  an  unknown  true  parameter  value  0O  such  that  the  dgp 
is  /(y|0o),  and  we  seek  a  point  estimate  that  is  a  good  estimate  of  9q.  In  Bayesian 
analysis,  in  contrast,  interest  lies  in  the  entire  distribution  of  9,  which  is  determined  by 
both  0O  and  prior  beliefs  about  0O. 

Point  estimation  is  therefore  emphasized  much  less  in  Bayesian  analysis.  For  conve¬ 
nience  the  posterior  mean  or  the  posterior  median  are  nonetheless  commonly  reported 
as  point  estimates.  By  specifying  a  loss  function  an  optimal  point  estimate  of  a  param¬ 
eter  can  be  obtained;  see  Section  13.2.7. 
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Posterior  Intervals 

Once  the  posterior  distribution  has  been  obtained,  it  can  be  used  to  make  probability 
statements  analogous  to  those  in  the  frequentist  analysis.  In  particular,  we  can  consider 
Bayesian  confidence  intervals  and  regions. 

For  the  kth  parameter,  a  100(1  —  a )  %  posterior  density  interval  7 Z(6k)  is  any 
interval  that  64  falls  into  with  posterior  probability  a,  or  formally 

1  -  or  =  Pr  [6k  e  Um |y]  =  [  p(6k\y)d6.  (13.15) 

There  are  many  regions  that  correspond  to  this  probability.  The  simplest  posterior  in¬ 
terval  is  one  between  the  a/2  and  (1  —  a/2)  quantiles,  such  as  between  the  2.5  and 
97.5  percentiles.  More  complicated  is  a  highest  posterior  density  (HPD)  interval 
that  satisfies  (13.15)  and  additionally  the  condition  that  no  point  in  7 Z(9)  has  a  smaller 
probability  density  than  any  point  outside  the  region.  This  interval  need  not  be  con¬ 
tiguous  if  the  posterior  is  multimodal,  and  it  differs  from  the  simpler  interval  unless 
the  posterior  is  symmetric  and  unimodal. 

These  intervals  can  be  extended  to  regions.  A  100  ( 1  —  a)  %  highest  posterior  den¬ 
sity  region  71(9)  is  a  region  such  that 

1— a  =  Pr[6>e72(6>)|y]=  [  p(0\y)dO.  (13.16) 

JK(8) 

An  attraction  of  the  Bayesian  approach  is  that  a  posterior  interval  is  much  simpler  to 
interpret  than  a  confidence  interval  in  frequentist  analysis.  If  a  95%  posterior  interval 
for  6k  is  (1 , 4),  then  64  lies  between  1  and  4  with  posterior  probability  0.95.  In  contrast, 
for  a  frequentist  95%  confidence  interval  for  9k  equal  to  (1, 4)  we  can  only  say  that  if 
it  were  possible  to  repeat  the  analysis  with  many  different  samples  yielding  many 
different  confidence  intervals,  then  95%  of  these  confidence  intervals  will  include  the 
true  value  of  64. 


Hypothesis  Testing 

Hypothesis  testing  receives  little  attention  in  the  Bayesian  context.  As  noted  in  the 
discussion  of  point  estimation,  interest  does  not  lie  in  determining  the  true  parameter 
value  #o-  Instead,  interest  lies  in  the  distribution  of  the  range  of  values  that  6  might 
take  given  the  data  and  a  prior.  For  model  comparison  see  Section  13.8. 


Conditional  Posterior  Density 


The  conditional  posterior  density  of  64,  given  6j,  can  be  obtained  from  the  joint  and 
marginal  posterior  densities  as 


p(9k\6j,  0j  e  6  k,  y)  = 


P(0k,0j  ly) 

p(0j\  y) 


(13.17) 


Of  special  interest  and  significance  is  the  set  of  q  conditional  distributions  p(6k\9  -k), 
k  =  1 ,  . . . ,  q ,  also  known  as  the  set  of  full  conditional  distributions.  These  play  an 
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important  role  in  the  modem  computational  techniques  for  obtaining  the  joint  posterior 
distribution  presented  in  later  sections. 

The  definitions  of  marginal  and  conditional  posteriors  in  (13.15)  and  (13.17)  can  be 
extended  from  individual  parameters  to  blocks  of  parameters. 


Marginal  Likelihood 

The  marginal  probability  or  marginal  likelihood  is  the  denominator  in  Bayes’  rule 
and  is  defined  as 

/( y)  =  J  Hy\0)n(G)de.  (13.18) 

It  is  the  expected  value  of  the  likelihood,  E[L(y| (?)],  where  the  expectation  is  with 
respect  to  the  prior  density.  The  marginal  likelihood  constitutes  a  basis  for  Bayesian 
inference  (see  Section  13.8),  as  it  contains  information  about  the  support  in  the  data 
for  the  prior. 


Posterior  Predictive  Density 

Consider  out-of-sample  prediction  of  a  single  observation  yp.  This  has  density 
f(yp\9),  where  6  is  unknown.  The  posterior  predictive  density  of  yp  weights  this 
density  by  the  posterior  probability  distribution  of  6,  yielding 

fP(yp)  =  f  f(yp\G)P(G\y)dG.  (13.19) 

If  covariates  appear  in  the  likelihood  function  as  in  a  regression  model,  then  these 
densities  are  conditioned  on  them  also. 


13.2.6.  Large-Sample  Behavior  of  the  Posterior 

The  influence  of  even  informative  priors  on  the  posterior  diminishes  as  the  sample 
becomes  large,  as  illustrated  in  the  Section  13.2.2  example.  This  is  the  basis  of  the 
statement  that  asymptotically  the  likelihood  dominates  the  inference  or  that  the  weight 
assigned  to  the  prior  essentially  goes  to  zero  as  the  sample  size  grows. 

Because  the  posterior  distribution  can  be  awkward  to  manipulate,  an  asymptotic 
approximation  to  the  posterior  is  of  interest  as  it  can  be  used  in  place  of  the  true  finite- 
sample  posterior  distribution.  This  approximation  is  easy  to  obtain  since  asymptoti¬ 
cally  the  posterior  equals  the  likelihood.  We  follow  Gelman  et  al.  (1995),  to  which  the 
reader  is  referred  for  additional  detail. 

For  simplicity  assume  that  observations  are  iid.  Then  the  log-posterior 

N  N 

£>  p(G\yi)  =  In  7r  (G)  +  In  /  O';  1 9).  (13.20) 

;=i  (=i 

This  representation  makes  it  clear  that  in  a  large  sample  the  posterior  is  dominated  by 
the  likelihood  contribution,  since  the  contribution  of  the  prior  to  the  posterior  remains 
fixed  whereas  the  contribution  of  the  sample  to  the  posterior  grows  with  N. 
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Assume  that  the  posterior  p(9 |y)  is  unimodal  and  approximately  symmetric.  We 
consider  the  asymptotic  properties  of  the  posterior  mode,  denoted  by  9,  which  is  then 
the  local  and  global  maximum  of  the  posterior. 

To  establish  consistency  of  9,  we  note  that  the  posterior  mode  converges  to  the 
MLE  as  N  — >  oo,  since  the  second  term  in  (13.20)  dominates.  The  posterior  mode  is 
therefore  consistent  if  the  MLE  is  consistent.  So  9  9q  if  the  dgp  for  y  has  density 
f(y\9o)  and  the  usual  regularity  conditions  for  ML  estimation  are  satisfied. 

To  obtain  the  asymptotic  distribution  of  9,  consider  a  second-order  Taylor  series 
expansion  of  the  log  posterior  density  around  the  posterior  mode  9.  Then 


In  p  (0|y)  -  In  p(9\y)  +  -(9-9)' 


"  32  In  p(9\y) 

8989' 

8=8- 

(9-9), 


(13.21) 


where  simplification  occurs  because  3  In  p(9\y)/89  =  0  when  evaluated  at  the  poste¬ 
rior  mode,  and  we  assume  that  third-  and  higher  order  derivatives  of  9  can  be  ignored 
asymptotically.  Define 


1(9)  = 


d2  In  p(9\y) 
8989' 


to  be  the  observed  information  based  on  the  posterior  density  In  p  (0|y),  evaluated  at 
the  posterior  mode.  Then  exponentiating  (13.21)  yields 


p(9 |y)  oc  exp 


-iO-eyime-B) 


which  is  the  kernel  of  multivariate  normal  distribution  with  mean  9  and  variance  ma¬ 
trix  l(9)~l .  It  follows  that  a  posteriori 

9\y  ~  Af[9,I(9r]].  (13.22) 


As  the  sample  size  N  grows  large,  the  likelihood  component  of  the  posterior  be¬ 
comes  dominant  and  the  influence  of  the  prior  becomes  negligible.  In  this  case  we 
may  replace  the  mode  9  by  the  MLE,  which  is  the  mode  of  the  likelihood  density.  This 
yields  a  result  that  is  sometimes  called  a  Bayesian  central  limit  theorem  (Gamerman, 
1997).  Asymptotically,  frequentist  and  Bayesian  inferences  will  be  based  on  the  same 
limiting  multivariate  normal  distribution,  and  hence  there  should  be  no  significant  in¬ 
consistency  between  them. 

This  result  has  been  labeled  as  the  Bernstein-von  Mises  Theorem  in  the  literature; 
see  Train  (2003,  chapter  12)  for  an  accessible  discussion  of  the  three  components 
of  this  theorem.  These  components  comprise  (1)  the  result  that  the  posterior  mean 
converges  in  probability  to  the  maximum  likelihood  estimator,  (2)  that  it  has  a  limiting 
normal  distribution,  and  (3)  that  the  limiting  distribution  of  the  posterior  mean  is  the 
same  as  that  of  the  maximum  likelihood  estimator.  These  results  are  all  implicit  in 
the  Bayesian  central  limit  theorem.  That  theorem  is  of  great  interest  and  relevance  to 
those  who  wish  to  apply  the  likelihood  principles  of  estimation  and  inference.  The  full 
force  of  its  implications  will  become  apparent  after  we  examine  numerical  methods 
for  approximating  the  posterior  distribution. 
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Do  the  preceding  arguments  imply  that  Bayesian  and  likelihood-based  methods  will 
produce  essentially  similar  results?  Is  the  choice  between  the  two  approaches  may 
largely  a  matter  of  computational  efficiency?  A  definitive  treatment  of  these  issues 
is  not  available.  However,  there  are  a  number  of  examples  in  the  literature  that  show 
not  only  that  the  two  approaches  may  produce  similar  results,  but  also  that  Bayesian 
methods  are  frequently  computationally  more  efficient. 


13.2.7.  Bayesian  Decision  Analysis 


Given  the  full  posterior  distribution  p(9  |y),  which  point  estimate  of  9  should  be  re¬ 
ported?  This  question  was  studied  in  Section  4.2  for  best  prediction  of  y  using,  for 
example,  squared  error  loss.  Here  instead  we  consider  best  estimation  of  9  using,  for 
example,  quadratic  loss. 

Let  L (9,9)  denote  the  specified  loss  function,  where  9  is  an  estimate  of  the  unknown 
9.  The  loss  is  unknown,  as  it  depends  on  9,  which  is  unknown.  We  can,  however,  find 
the  expected  value  over  9  of  the  loss  since  Bayesian  analysis,  unlike  classical  analysis, 
provides  the  distribution  of  9.  The  optimal  estimator  0opt  is  the  estimator  9  that 
minimizes  expected  posterior  loss,  or 

minE[L(0,?)]  =  min  /  L(6,9)p(9\y)d9,  (13.23) 

d  e 


Losses  associated  with  different  (9,9)  are  weighted  by  the  posterior  probability 
P(9  |y). 

It  can  be  shown  that  the  posterior  mean  is  the  optimal  estimator  under  quadratic  loss, 
L (9,9)  =  (9—  9)' (9—  9).  If  instead  absolute  error  loss  is  used,  with  L (9,9)  =  \9—9\, 
then  the  posterior  median  is  the  optimal  estimator.  Once  the  posterior  distribution 
has  been  established  these  point  estimates  can  be  computed  either  analytically  or 
numerically. 

Under  some  conditions  minimizing  expected  posterior  loss  can  be  shown  to  be 
equivalent  to  minimizing  expected  posterior  risk.  The  risk  function  averages  the  pos¬ 
sible  loss  over  hypothetical  samples  of  y  from  the  population,  so 

R (9,9)  =  I  L(9,9)f(y\9)dy. 

To  avoid  the  possible  confusion  between  loss  function  and  likelihood  function,  here 
and  in  the  next  equation  block,  we  have  used  f(y\9)  as  equivalent  to  the  likelihood 
L(y|0).  Expected  posterior  risk  averages  this  risk  over  different  values  of  the  parame¬ 
ters  9  e  0  by  weighting  with  respect  to  the  posterior  density,  so 


E[R  (9,9)]  = 


=  /.{/ 
/II 
/ 


L(0,9)f(y\9)dy\  P(9\y)d9 


(13.24) 


L(0,9)P(9\y)d0  \f(y\9)dy 


=  /  E[L(9,9)]f(y\9)dy, 
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where  in  the  first  equality  the  outer  integral  ranges  over  the  domain  of  6,  in  the  second 
equality  the  order  of  integration  is  interchanged,  and  in  the  third  line  the  conclusion 
follows.  These  operations  presume  that  appropriate  restrictions  on  L(f 1,6)  and  p  (0|y) 
are  satisfied.  For  example,  p  (0|y)  must  be  a  proper  density  function  and  the  loss  func¬ 
tion  must  be  integrable.  Hence  expected  risk  will  remain  bounded  and  minimizing  it 
is  a  well-defined  operation. 

The  foregoing  argument  establishes  a  well-known  and  important  result  that  the 
Bayes  estimator  is  admissible  in  the  sense  that  it  minimizes  expected  risk  for  a  speci¬ 
fied  loss  function. 


13.3.  Bayesian  Analysis  of  Linear  Regression 

Because  the  analysis  of  linear  regression  is  a  familiar  topic,  it  provides  a  useful  por¬ 
tal  to  more  general  nonlinear  models.  The  data  are  assumed  to  be  generated  by  the 
standard  linear  regression  model 


y  =  x/3  +  u, 

where  X  denotes  the  N  x  K  full  column  rank  matrix  of  weakly  exogenous  re¬ 
gressors.  The  errors  are  assumed  to  be  independent,  homoskedastic,  and  nor¬ 
mally  distributed,  with  u  ~  A/"[0,<72Ijv  ]•  The  sample  conditional  density  is  therefore 
y|X,  /3,cr2  ~  Af[X/3,cr2Ijv].  Our  exposition  follows  Zellner  (1971). 

We  deal  in  turn  with  noninformative  and  informative  priors.  In  both  cases  a  closed- 
form  expression  for  the  posterior  can  be  obtained  after  some  considerable  algebra.  For 
noninformative  prior  it  will  be  seen  that  the  OLS  estimator  has  a  Bayesian  interpreta¬ 
tion  as  the  mean  of  the  posterior  distribution.  In  the  informative  prior  case  it  will  be 
seen  that  the  posterior  moments  are  weighted  functions  of  the  sample  and  prior  means. 

Subsequent  sections  present  methods  for  less  tractable  models,  but  even  then  anal¬ 
ysis  is  simplified  if  results  similar  to  those  given  in  this  section  can  be  applied  to  some 
subcomponents  of  the  model. 


13.3.1.  Noninformative  Priors 

For  noninformative  priors  we  use  Jeffreys’  priors.  From  Section  13.2.4,  for  y  ~ 
A f[pi,  o2]  this  prior  for  /x  (given  a 2  known)  is  a  constant,  whereas  the  prior  for  a2 
(given  /x  known)  is  proportional  to  a2.  For  the  regression  case  this  extends  to  constant 
prior  for  fij,  j  =  1, . . . ,  K,  so  7T  (fij)  oc  c,  and  the  prior  for  a2  is  n  (a2)  oc  l/o2. 
The  prior  views  all  values  of  /-j;  as  equally  likely,  whereas  smaller  values  of  a 2  are 
viewed  as  being  more  likely.  Assuming  independence  of  / 3  and  a 2  the  joint  prior  is 

7T  (/3,cr2)  oc  1  /a2. 
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The  likelihood  function  can  be  reexpressed  as 
L(/3,or2|y,  X)  =  (27tct2)  V'2  exp 

cx  (o2YN/2zx p  (/3-3)'X'X(/3-3)}) 

cx  (<r2)_JV/2exp  K)s2  +  (/3-3)'X'X(/3-3))  , 

where  (3  =  (X'X)  1  X'y  and  u  =  y  —  X/3;  the  second  line  uses  y  —  X(3  =  u  — 
X(/3— (3)  and  X  u  =  0;  and  the  third  line  uses  .v2  =  u'u/  (N  —  K). 

Combining  the  likelihood  in  (13.25)  and  the  prior,  we  obtain  the  posterior  density 


-^y-x/3)'(r 


x/3) 


(13.25) 


p(/3-cr2|y,  X)  (13.26) 

/  1  \N/2  /  1  ^  ^  \  1 
cx  (  — J  exp  I  {(AT  -  K)s2  +  (/3-/3)'X'X(/3-/3)} J  ^ 

/  1  X^+l  /l  _  \ 

cx  -  exp  {(A  -  K)s2  +  (/3-/3)'X'X(/3-/3)}  J 

cx  J  '  exp  (~(/3-3)'  (cr2(X'X)  ')'*  09-3))  J 


1/  j  \  (/V-/0/2+1 

exp 


(A-X)s2\| 

2a2  j]' 


The  conditional  posterior  distribution  p(j3\cr2,  y,  X)  of  (3,  given  a2,  and  the  data 
y,  X,  is  clearly  the  K -dimensional  multivariate  normal  with  mean  (3  and  variance 
er2  (  X'X)  ,  since  (3  appears  only  in  the  first  line  of  the  final  expression.  The  con¬ 
ditional  posterior  of  a2  given  (3  is  more  difficult  to  obtain  as  o2  appears  in  both  lines. 

The  marginal  posterior  of  /3,  obtained  by  integrating  out  a2,  is  much  more  use¬ 
ful  for  posterior  inference  about  f3.  We  integrate  the  second  line  of  (13.26),  change 
variables  to  z  =  1  /cr2  and  use  the  result  that  /0°°  zc  exp  (— az )  dz  =  T(c  +  l)/ae+1  for 
given  constants  a  >  0,  c  >  —  1,  where  here  c  =  N / 2+1  and  a  =  {•}  is  the  lengthy 
term  in  braces.  This  yields  the  kernel  of  the  marginal  posterior  distribution 

P(0\J,  X)  cx  {(IV  -K)s2  +  (/3— /3/X'X(/3— /3)}_a,/2  (13.27) 

cx  j  1  +  03-/3)'  (s2  (N  -  K)  (X'X)-1)  09-/3)  J 

which  from  Section  13.3.5  is  the  kernel  of  a  multivariate  Student  /-distribution  cen¬ 
tered  at  (3  with  N  —  K  degrees  of  freedom  and  covariance  matrix  s2  (X'X)  multi¬ 
plied  by  (N  -  K)  /  (N  -  K  -  2).  Thus 

(3  ~  tK  (3,  s2(X'X)-*) .  (13.28) 

An  individual  element  of  (3  has  a  univariate  Student  /-distribution. 

The  marginal  posterior  for  a 2  is  more  easily  obtained,  by  integrating  the  final  ex¬ 
pression  in  (13.26)  with  respect  to  (3  and  noting  that  (3  appears  in  only  the  first  line, 
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which  is  the  kernel  of  the  Af[f3,  cr2(X'X)  1 1  density  and  integrates  to  one.  It  follows 
that  the  marginal  posterior  for  ct 2  is 


P(a2 |y,  X)  oc  (ct2) 


2\-(N-K+l)/2 


exp  - 


(N  —  K)sz 


2a1 


(13.29) 


This  expression  is  known  to  be  the  kernel  of  an  inverted  square-root  gamma  density. 
That  is,  it  is  the  density  of  a  random  variable  that  is  the  reciprocal  of  the  square-root 
of  a  gamma-distributed  random  variable  with  degrees-of-freedom  parameter  N  —  K. 
This  result  is  identical  to  that  obtained  under  the  frequentist  analysis  of  the  distribution 
of  /3. 

For  normal  linear  regression,  Bayesian  analysis  with  noninformative  priors  there¬ 
fore  yields  qualitatively  similar  conclusions  to  those  from  the  standard  frequentist  anal¬ 
ysis  in  finite  samples.  Conditional  on  a2  the  posterior  of  (3  is  the  Af[f3,  ct2(X'X)_1] 
distribution,  and  unconditionally  the  posterior  of  (3  is  the  multivariate  t -distribution. 

The  interpretation  is  quite  different,  however,  as  these  distributions  are  of  the  un¬ 
known  parameter  (3  with  mean  [3,  rather  than  of  an  estimate  (3  with  unknown  mean  (3. 
For  example,  the  Bayesian  95%  HPD  interval  for  /i;  i s  ft  ■  ±  t.025,N-K  x sc [ /i  ,  ].  where 
se [/J ; ]  =  (s2(X'X)jj)l/2.  From  Section  13.2.5  the  interpretation  is  that  /3j  lies  in  this 
interval  with  posterior  probability  0.95. 


13.3.2.  Informative  Priors 

Bayesian  analysis  of  the  normal  linear  regression  model  under  informative  priors  is 
especially  insightful  if  we  use  independent  conjugate  priors  for  (3  and  a.  From  Sec¬ 
tion  13.2.4,  the  conjugate  prior  for  (3  is  the  normal,  and  the  conjugate  prior  for  1  /a2  is 
the  gamma.  This  leads  to  the  normal-gamma  prior 

7T(/3,l/<72)  =  7TN(f3\l/a2)jty(l/a2), 

where  jtn  (/3|1/ct2)  is  the  Af[(30,  ct2S2()  1  ]  density,  with  f3()  and  S2o  known,  and  the 
kernel  is 


7Tjv(/3|  l/cr2)  oc  a  K  exp 


(f3-(30)'Cl0{(3-(30) 

2  a2 


(13.30) 


and  Tty  (l /ct2)  is  the  Q  [vq.  density  where  vq  and  s{2  are  known  constants,  and 


Tty(l/a2)  =  a~(Va+l)  exp 


2ct2 


(13.31) 


Note  that  the  prior  for  the  (location)  parameter  (3  depends  on  the  (scale)  parameter 
ct.  This  makes  sense  as  ct  reflects  the  scale  on  which  y  is  measured  and  hence  should 
affect  (3.  Given  this  prior  and  the  likelihood  in  (13.25),  the  posterior  density  is  of  a 
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normal-gamma  type.  After  some  algebra  it  is  as  follows: 


l/cr|y,  X)  oc  (cr2)  exp 
/  2\— 

x  (cr  )  exp 


s2(N  -  K) 


exp 


2(7  2 

(e-Poysio  (p-p0) 


(p-pyx'xw-p) 

2(7  2 


2a2 


/  2\-(yo/2)-l 

x  (a  )  exp 


/  2\(yo+^)/2—  1  °1  /  2\ 

cc{a-)  exp  (cr-) 


Vj^o' 
2  cr2 

A~  1 

2(7  2 


2\-K/2 


x  exp 


(13.32) 


where  (3  and  fl ,  1  denote  the  posterior  mean  and  variance  of  (3  and  .v(  denotes  the 
posterior  mean  of  a 2  defined  as 


(3  =  (n0  +  x'x)  1  (O0/30  +  X'X/3), 
=  (r20  +  x'x). 


■«?  =  si  +  u'u  +  [o0  1 


(X'X)-1]  (>3-/3)  . 


(13.33) 


The  posterior  mean  (3  is  obtained  by  using  the  matrix  version  of  the  “completing  the 
square”  operation.  Specifically,  given  the  K  x  1  vectors  (3,  (3,  (30,  and  f3,  and  K  x  K 
symmetric  square  matrices  A  and  B,  it  can  be  shown  that 

(/ 3  -  Ai)'  A  {(3  -  f30)  +  <J3-pmt3~% 

-  (p-p)  (k  +  V)[p-P)  +(/30-^)'aB(A  +  B)-1(^0-^), 

where  P  =  (A  +  B)_1  (A/30  +  B/3). 

The  joint  marginal  posterior  of  P  and  a 2  is  of  the  same  normal-gamma  form  as  the 
prior. 

The  conditional  posterior  of  P  given  a2  has  mean  P,  a  matrix-weighted  average  of 
the  prior  mean  PQ  and  the  sample  mean  p. 

In  general  using  a  conjugate  prior  is  algebraically  equivalent  to  augmenting  the 
data  with  a  sample  from  the  same  distribution.  In  this  case  the  normal-gamma  prior 
is  equivalent  to  an  additional  sample  of  the  same  process  with  regression  parameter 
estimate  of  P0,  X'X  matrix  equal  to  degrees-of-freedom  parameter  equal  to  i>o,  and 
error  sum  of  squares  equal  to  vo^o-  Since  fio  is  a  fixed  matrix,  tto/N  -a-  0  as  N  — >  oo, 
whereas  X'X/ N  converges  to  a  matrix  of  constants.  Hence  p  P,  verifying  that  in 
large  samples  the  ML  estimator  and  the  posterior  mean  are  equivalent.  The  posterior 
variance  $7/*  is  proportional  to  ($7o  +  X'X)  .  See  Learner  (1978)  for  a  more  detailed 
exposition. 


438 


13.3.  BAYESIAN  ANALYSIS  OF  LINEAR  REGRESSION 


The  marginal  posterior  of  f3  is  obtained  by  integrating  a 2  out  of  the  joint  posterior. 
This  yields 


P(J3 ly.  X)  ex 


si  + 


(/3-/3)'(f20  +  X'X)  (/3-/3) 


—(vi+K/2) 


(13.34) 


hence  a  marginal  posterior  is  a  multivariate  Student  /-distribution,  one  that  is  centered 
around  (3  rather  than  around  (3  as  in  the  case  of  uninformative  prior. 

Because  the  conjugate  prior  treats  the  prior  information  like  a  previous  sample  from 
the  same  process,  the  sample  and  prior  information  are  handled  symmetrically  even 
though  the  information  from  the  two  sources  may  be  in  conflict.  Thus  the  mathemat¬ 
ical  convenience  of  using  conjugate  priors  comes  at  a  price.  If  the  prior  information 
and  the  sample  information  are  apparently  in  conflict,  the  posterior  distribution  can  be 
expected  to  be  bimodal  with  the  modes  corresponding  to  sample  and  prior  means.  A 
prior  distribution  that  allows  one  to  capture  such  a  feature  is  a  prior  that  specifies  that  (3 
has  a  multivariate  Student  /-density  independent  of  1  / a 2  and  1/a2  has  a  gamma  prior 
distribution  independent  of  X/3.  This  has  been  called  “Dickey’s  prior”  (Learner,  1978, 
p.  79).  Under  this  assumption  the  marginal  posterior  is  a  product  of  two  multi¬ 
variate  Student  /-densities;  this  product  can  also  be  expressed  as  a  mixture  of  two 
/-distributions.  Such  a  distribution  can  potentially  exhibit  bimodality.  Learner  (1978) 
has  provided  a  more  extensive  analysis  of  this  case. 


13.3.3.  Mixed  Estimation 

We  seek  to  place  Bayesian  analysis  of  linear  regression  in  a  frequentist  setting. 

Frequentist  analysis  usually  incorporates  prior  information  as  equality  constraints, 
which  is  a  limiting  case  of  Bayesian  analysis  where  the  variance  parameters  in  the 
prior  go  to  zero.  Prior  information  that  is  instead  stochastic  can  also  be  incorporated 
into  frequentist  analysis,  by  using  mixed  estimation.  The  algebra  is  simple,  and  the 
approach  also  provides  an  intuitive  understanding  of  how  Bayesian  procedures  pool 
prior  and  sample  information. 

We  continue  with  the  linear  regression  model  under  normality.  Assume  prior  infor¬ 
mation  for  the  regression  parameters  that  / 3  ~A/’[0,a2I/j  ].  where  extension  to  nonzero 
mean  is  relatively  easy.  The  prior  information  can  be  written  as 


13  =  0  +  v, 


where  v  is  a  K  x  1  error  with  v  ~  A/"[0,a2IA  ].  Now  augment  the  sample  informa¬ 
tion  y  =  X/3  +  u  by  this  prior,  and  write  the  full  model  as  an  augmented  regression 
model 


y 

o 


u 

—  V 


439 


BAYESIAN  METHODS 


This  can  be  reparameterized  as 


1 

o 

1 _ 

X 

u 

= 

a 

-Ik 

(3  + 

a 

- y 

1-  _] 

L(jv  J 

L  crv  J 

_ i 

+ 

1 

3  * 

i _ 

J  1 

Lv  J 

(13.35) 


where  A  =  o/ov  and  the  transformation  v*  =  — Av  has  been  used  so  that  all  errors  have 
common  variance  a2. 

The  estimator  based  on  this  augmented  data  set  is  a  pooled  estimator  or  a  mixed 
estimator.  Conditional  on  A,  the  mixed  estimator  is 

A  =  [X'X +  A2!,]"1  X'y  (13.36) 

=  [X'X(I*  +  A2  (X'Xf  V'X'y 
=  [I*  +  X2  (X'X)-1]-1  (X'X)”1  X'y 

=  A;.3, 

where  =  \iK  +X2  (X'X)  *]-1,  and  [3  =  (X'X)  1  X'y  is  the  unrestricted  OLS 
estimator. 

This  estimator  is  the  so-called  ridge-regression  estimator  introduced  without  a 
Bayesian  justification  by  Hoerl  and  Kennard  (1970)  to  combat  the  problem  of  mul- 
ticollinearity  in  small  samples.  This  estimator  also  belongs  to  a  class  of  shrinkage 
estimators,  in  which  the  estimator  is  shrunk  toward  (or  pulled  toward)  a  prior  mean, 
in  this  case  the  zero  vector.  This  sometimes  makes  some  sense  in  a  finite  sample  with 
highly  multicollinear  data  where  the  “f-ratios”  tend  to  zero,  making  it  difficult  to  dis¬ 
tinguish  between  variables  whose  coefficients  are  truly  close  to  zero  and  those  that 
only  appear  to  be  that  way.  In  the  limit  shrinkage  leads  to  variable  exclusion. 

Several  features  of  (3k  are  noteworthy:  (1)  Conditional  on  A,  f3x  is  the  mean  of  a 
posterior  distribution  of  (3.  (2)  The  estimator  is  a  matrix-weighted  average  of  0  vector 
and  (3.  (3)  The  algebra  changes  very  little  if  we  chose  to  shrink  the  estimator  toward 
some  nonzero  /3,  say  (30.  Then  the  resulting  estimator  is  a  matrix-weighted  average 
of  vectors  (30  and  (3. 

The  symmetric  weighting  matrix  A;  =  [lx  +(A 2 /N)  (N  'X'X)  *]  — >  1^  as  N  — >• 
oo,  since  A 2 /N  — >  0.  Therefore, 


(3X  (3  as  N  ->  oo, 


so  the  effect  of  the  prior  on  the  posterior  mean  vanishes  as  the  sample  becomes  large. 
Similarly,  the  conditional  posterior  variance  of  f3}  is  given  by 

V[3,l  =  A,V[3]A, 

=  cr2A1(X,X)-1Ail 

so  V[/3a]  — >  cr2(X'X)_1  as  the  sample  size  N  -a-  oo. 
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For  finite  samples,  conditional  on  X  and  cr2,  the  conditional  posterior  distribution 
of  (3X  is 


A\X,  cr2  ~  Af[Ax%  cT2Ax(X'XrlA'x].  (13.37) 

The  marginal  posterior  distribution  of  (3X  is  obtained  by  integrating  out  X  and  a2.  Treat¬ 
ing  X  as  given,  and  assuming  a  vague  or  uninformative  prior  on  tr2,  we  can  integrate 
out  o 2  as  was  shown  in  Section  13.3.1.  This  integration  operation  is  analytically  feasi¬ 
ble  and  yields  a  marginal  posterior  of  fix  that  is  the  multivariate  Student  /-distribution. 
Finally,  we  can  specify  a  prior  distribution  on  X,  possibly  a  gamma  prior  since  X  >  0, 
and  proceed  to  integrate  it  out.  Flowever,  X  enters  the  conditional  posterior  in  an  awk¬ 
ward  fashion  and  cannot  be  integrated  out  analytically.  At  this  stage  we  would  need 
to  resort  to  a  numerical  technique.  Assuming  that  this  is  accomplished  then  we  have  a 
Bayesian  treatment  of  this  model. 


13.3.4.  Hierarchical  Priors 

We  consider  a  three-stage  linear  regression  model  that  is  hierarchical  in  regression 
parameters  but  not  in  variance  parameters. 

The  first  stage  is  a  linear  regression  model  denoted  y  =  Xl/3l  +  u,  where  the  sub¬ 
script  1  is  added  to  distinguish  between  first-  and  second-stage  parameters  and  regres¬ 
sors.  The  parameters  (3X  are  random  and  are  modeled  to  depend  on  both  parameters 
and  data,  so  (3X=  X2/32  +  v.  For  example,  the  first  level  models  individual  student 
test  performance  and  the  second  level  brings  in  school  characteristics.  The  errors  are 
assumed  to  be  normally  distributed.  The  second-level  parameters  (32  are  treated  as  un¬ 
known  and  a  prior  is  specified.  A  prior  is  also  specified  for  the  variance  parameter  a2 
in  the  first-stage  model. 

Assuming  normally  distributed  errors  and  using  conjugate  priors  leads  to  the  fol¬ 
lowing  model: 


ylXi.^.af  ~  U[X^,a2\Nl  (13.38) 

A  |X2,  P2,  S2  ~  A f[X2p2,  S2],  (13.39) 

P2  ~  M[P\  £*].  (13.40) 

<7f2|v*,  cr*2  ~  Q\v*l 2,  t>*cr*2/2],  (13.41) 


where  X\  is  N  x  K,  X2  is  K  x  M,  (3X  is  K  x  1,  P2  is  M  x  1,  Si  is  A"  x  K,  (3*  is 
M  x  1,  and  S*  is  M  x  M.  For  the  regression  parameter  Pl  the  second  line  gives  the 
prior,  and  the  third  line  gives  the  subsequent  second-stage  prior,  or  a  prior  on  a  prior, 
for  P2  (while  S2  is  assumed  known).  The  parameters  (/3*,  £*)  are  often  referred  to  as 
hyperparameters.  For  variance  parameters,  the  fourth  line  gives  a  prior  for  the  variance 
parameter  a2  with  v*  and  a*2  specified.  The  innovation  is  the  addition  of  (13.40). 

Note  that  we  can  collapse  the  stages  and  convert  this  into  a  two-level  model. 
Specifically,  we  can  write  a  two-stage  model  with  an  informative  prior  in  one  of  two 
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ways,  either 

ylX!,^,^ -A^^CTflyv], 

Aix2,  s2  ~  aa[x2(3*,  s2  +  x2s*x;i 


or 


y|Xi,  X2,  P2,  X2,  <r2  ~  Afp^X^,  a\ IN  +  X^X',], 

02  ~.A/'[/3*,E*]. 

If  ay  were,  given  this  setup  corresponds  to  conditionally  conjugate  normal  priors. 
Using  results  introduced  earlier  we  can  derive  expressions  for  the  posterior  means  of 
either  ftl  or  02  as  matrix-weighted  averages  of  either  0*  and  ftx  or  of  ft*  and  ft2. 

The  use  of  the  normal  distribution  is  only  illustrative.  Hierarchical  models  for  gen¬ 
eralized  linear  models,  members  of  the  linear  exponential  family,  have  been  widely 
used  (Albert,  1988). 

In  hierarchical  models  it  may  not  be  possible  to  obtain  the  full  posterior  probabil¬ 
ity  distribution  of  first-stage  parameters  such  as  /3l  in  an  analytically  tractable  form. 
Fortunately,  the  advances  in  computational  methods  presented  in  the  next  section  are 
especially  well  suited  to  models  with  a  hierarchical  structure. 

Another  approach,  which  is  an  application  of  the  empirical  Bayes  method,  involves 
estimation  of  parameters  in  the  higher  stage  priors,  similar  to  that  in  the  likelihood 
approach.  This  approach  avoids,  for  example,  assuming  that  S2  and  X*  are  known 
matrices. 


13.3.5.  Multivariate  t-  and  Wishart  Distributions 


Bayesian  analysis  makes  use  of  a  wider  range  of  distributions  than  classical  analysis. 
Here  we  present  details  on  two  multivariate  distributions  that  are  used  in  Bayesian 
analysis  of  linear  regression  under  normality. 

The  multivariate  t -distribution  is  a  multivariate  extension  of  the  univariate  student 
t.  It  is  similar  to  the  multivariate  normal,  except  that  the  tails  of  the  distribution  can  be 
considerably  fatter.  In  Bayesian  analysis  it  arises  as  the  marginal  posterior  for  ft  given 
a  conjugate  normal  prior  (see  Section  13.3.2)  or  can  be  used  directly  as  the  prior  for  ft 
if  tails  fatter  than  the  normal  are  desired.  A  q  x  1  random  variable  t  that  is  multivariate 
Student-f  distributed  with  degrees-of-freedom  parameter  u,  mean  parameters  ft,  and 
dispersion  parameters  X,  has  joint  density 


/f(t|u,  ft.  £)  = 


r«u  +  l)/2) 

r(u/2)(jru)(1/2)|5;|1/2 


l  +  -(t-/i)'S_1(t-M) 
v 


-(«+?)/ 2 


where  T(-)  is  the  gamma  function.  This  distribution  is  symmetric  with  mode  ft,  mean 
ft  if  v  >  1,  and  variance  [  v/(u  —  2)]X  if  v  >  2.  The  tails  can  be  much  fatter  than  the 
normal  (e.g.,  the  variance  is  3U  if  v  =  3)  and  the  normal  is  obtained  as  v  — >  oo.  If 


442 


13.4.  MONTE  CARLO  INTEGRATION 


z  ~  A/"[0, 1]  and  s  ~  /2(u)  then  t=p  +  S_1/2z/ *Jsjv  has  the  multivariate  t- 
distribution  given  here,  providing  an  easy  way  to  obtain  draws. 

The  Wishart  distribution  is  a  multivariate  extension  of  the  univariate  chi-square 
distribution,  or  more  generally  the  gamma  distribution.  In  Bayesian  analysis  it  is  used 
as  the  conjugate  prior  for  the  inverse  of  the  covariance  matrix  of  a  multivariate  normal 
distribution.  A  q  x  q  random  positive  definite  matrix  W  that  is  Wishart  distributed 
with  degrees  of  freedom  parameter  v  >  q  and  scale  matrix  S  has  joint  density 

/w(W|u,  S)  =  2vq/2nq(q~1)/4  f[  r 

7=1  v  “  7 

x  |Sru/2|W|(v-?-1)/2exp  (— tr(S_1W)/2) , 

where  F(-)  is  the  gamma  function  and  tr(  )  denotes  the  trace  of  a  matrix.  This  dis¬ 
tribution  has  mean  uS.  The  sample  covariance  matrix  for  iid  multivariate  normal 
data  is  Wishart  distributed.  More  generally,  given  v{q),  independent  q  x  I  vectors 

Xj  ~  7V[0,  S | ,  /  =  1 . u,  then  ^V=1  xjx'j  *s  Wishart  distributed.  If  W_1  is  Wishart 

distributed  with  density  /w(W  l\u,  S)  then  W  is  inverse-Wishart  distributed  with 
density 

/iw(W|u,  S) 

=  2vql2nq(q~l)l4  n  r  |Sr/2|Wr(u+?+1)/2exp(-tr(S"1W)/2) . 

7=1  V  “  ' 


13.4.  Monte  Carlo  Integration 

In  many  modeling  situations  the  posterior  distribution  of  the  parameters  of  interest  is 
analytically  intractable.  In  such  cases  numerical  methods  are  needed  to  estimate  either 
the  full  posterior  distribution  or  some  key  moments  of  this  distribution  such  as  the 
posterior  mean. 

In  this  section  we  consider  computation  of  key  posterior  moments,  without  explic¬ 
itly  obtaining  the  posterior  distribution.  The  methods  of  Chapter  12  can  be  applied, 
with  potentially  less  computational  burden  since  the  integral  needs  to  be  computed 
once  for  the  entire  sample  rather  than  for  every  individual  at  every  iteration.  In  the 
subsequent  section  we  present  methods  to  simulate  the  posterior  distribution. 


13.4.1.  Importance  Sampling 

Suppose  the  problem  is  to  evaluate  the  posterior  moment  function  E[m(0|y)],  where 
expectation  is  with  respect  to  the  posterior  density  p(0 |y).  We  wish  to  compute 

E  [m(0)]=  /  m(0)p(6\y)dd.  (13.42) 

J  Rid) 

For  example,  the  posterior  mean  of  the  kth  parameter  is  E [9k\  =  f  Ok p(fi\y)d6.  Other 
examples  include  posterior  standard  deviations,  marginal  posterior  densities,  posterior 
intervals,  and  posterior  expectations  of  a  given  function  of  parameters. 


443 


BAYESIAN  METHODS 


From  Chapter  12  a  direct  Monte  Carlo  integral  estimate  of  E[m(9)]  is  E  [m(9)\  = 
S-1  Zs  m(9s),  where  6s,  s  =  1, . . . ,  S,  are  S  draws  of  9  from  the  posterior  density 
p(9 |y).  However,  this  estimate  is  infeasible  in  the  current  Bayesian  setting  if  there  is 
no  closed-form  solution  for  the  posterior  density  defined  formally  in  (13.1),  as  then 
it  is  not  possible  to  make  draws  from  the  posterior  p{9 |y).  Instead,  we  use  impor¬ 
tance  sampling,  introduced  in  Section  12.7.2.  The  integral  considered  in  (13.42)  can 
be  rewritten  as 

m(9)p(9\y)' 


E  [ m(9 )]  = 


L 


g(9)d9 , 


(13.43) 


lR(8 )  V  g(0) 

where  g(9)  >  0  is  a  known  density  function,  with  the  same  support  as  p{9 |y),  that  is 
easy  to  make  draws  from.  The  corresponding  Monte  Carlo  integral  estimate  is 


1  s 

E[7 n(9)]  =  -J2 


S=  1 


m(9s)p(9s  |y) 
g(8s)  ' 


where  9s,  s  =  1, . . . ,  S,  are  S  draws  from  of  9  from  the  importance  sampling  den¬ 
sity  g(9)  rather  than  from  the  original  target  density  p(9 |y).  Note  that  the  requirement 
that  p{9 |y)  and  g(9)  should  have  the  same  support  is  potentially  problematic  if  p(9 |y) 
depends  on  additional  parameters  or  if  the  functional  form  of  the  full  conditional  den¬ 
sities  is  known  but  that  of  the  marginal  posterior  is  not. 

Application  to  the  posterior  density  additionally  needs  to  account  for  the  constant 
of  integration  in  the  denominator  of  (13.1).  Let  pker(9 |y)  denote  the  kernel  of  the 
posterior  density,  where  pker(0|y)  =  L  (y\9)  n(9)  or  a  multiple  of  this  quantity.  How¬ 
ever,  for  notational  simplicity  the  dependence  on  y  is  suppressed  in  what  follows.  The 
posterior  density  is  then 

m  _  pkei(d) 

1  f  p^(9)d9' 


with  corresponding  posterior  moment 

E  [m(9)]=  [  m(9)  (  P  —  )  d9 
J  ’  \f  pkei(9)d9  ) 

f  m(9)  pker(9)d9 

~  f  pkei(9)d9 

_  J  (m{9)  pke'(9)/g(9j)  g(9)d9 

/  (pker(0)/g(0))  g(0)d9 

The  importance  sampling-based  estimate  of  the  posterior  moment  E [m(9)]  is  then 


E  [m(9)]  — 


I  Ef=i  mmpkeW)/g(Os) 

I  zLi  pket(°s)/g(os) 


(13.44) 


where  9s,  s  =  1, . . . ,  S,  are  S  draws  of  9  from  the  importance  sampling  density  g(9). 

This  method  was  proposed  by  Kloek  and  van  Dijk  (1978).  Geweke  (1989)  estab¬ 
lished  consistency  and  asymptotic  normality  under  some  regularity  conditions.  These 
conditions  include  the  assumptions  that  the  importance  sampling  density  g(9)  >  0 


444 


13.5.  MARKOV  CHAIN  MONTE  CARLO  SIMULATION 


over  the  support  R{9)  of  p(9):  that  E[m(0)]  <  oo,  so  the  posterior  moment  exists;  and 
that  f  p(9\y)d9  =  1,  so  the  posterior  density  is  proper.  As  previously  noted,  usually 
we  work  with  the  kernel  pkei(9 |y)  =  L  (y\9)  tt(9),  which  need  not  integrate  to  one. 
The  prior  n(9)  need  not  be  proper,  but  to  ensure  that  f  p{9\y)d9  =  1  it  is  necessary 
that  f  Jt(9)d9  <  oo. 

The  importance  sampling  approach  is  simple,  but  implementation  entails  subtleties 
well  explained  in  Geweke  (1989).  A  critical  requirement  is  that  the  g(9)  should 
have  thicker  tails  than  the  p(9 |y),  to  ensure  that  the  importance  weight  w(9)  = 
p(9\y)/ g(9)  remains  bounded.  In  view  of  the  asymptotic  normality  of  the  log  pos¬ 
terior,  a  good  choice  of  g(9)  is  a  multivariate  r-distribution,  with  the  mean  set  to  the 
posterior  mode,  and  the  covariance  matrix  proportional  to  the  inverse  of  the  Hessian 
of  the  log  of  the  posterior,  and  degrees  of  freedom  set  to  a  value  sufficiently  small  to 
ensure  thick  tails.  Geweke  (1989)  also  provides  a  measure,  called  the  relative  numer¬ 
ical  efficiency,  that  estimates  the  number  of  replications  required  to  achieve  a  given 
level  of  precision  of  E  [ m(9 )]  computed  using  draws  from  g{9)  relative  to  the  number 
of  replications  needed  if  draws  from  p(9 |y)  were  possible.  From  Chapter  12,  for  a 
higher  dimensional  integral  more  simulation  draws  are  required  to  get  a  good  approxi¬ 
mation  to  the  integral  and  one  might  additionally  use  simulation  acceleration  methods 
presented  in  Chapter  12,  such  as  antithetic  sampling. 

The  importance  sampling  method  uses  each  draw  9s  from  the  sampling  density 
g{9)  with  equal  probability.  A  more  efficient  approximation  would  weight  the  draws 
according  to  how  close  g(9s)  is  to  the  target  p(9s  |y).  This  can  be  done  by  importance 
resampling  (see  Gelman  et  al.,  1995). 

The  importance  sampling  method  can  be  used  to  provide  many  useful  summary 
measures  of  the  posterior,  as  presented  in  Section  13.2.5.  This  includes  estimates  of 
the  quantiles  and  percentiles  of  the  posterior,  permitting  calculation  of  95%  posterior 
intervals  and  plots  of  the  posterior  density  of  9k- 
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A  modern  idea  in  Bayesian  analysis  is  that  rather  than  concentrating  on  the  estimation 
of  key  summary  measures  of  the  posterior  distribution  (see  the  previous  section)  it  is 
desirable  to  obtain  a  large  sample  from  the  posterior  distribution.  Then  the  summary 
statistics  of  this  sample  from  the  posterior  will  provide  desired  information  about  the 
moment  characteristics  of  the  sample  of  estimates  and  about  other  interesting  associ¬ 
ated  measures  such  as  marginal  distributions  of  parameters  or  functions  of  parameters. 
For  example,  given  S  draws  from  the  posterior  distribution,  E[6y  can  be  estimated  by 

s-'EA'- 

The  challenge  is  to  make  draws  from  the  joint  posterior  distribution  when  there  is  no 
tractable  closed-form  expression  for  the  posterior  density.  If  a  suitable  density  exists 
for  computation  of  posterior  moments  using  importance  sampling,  then  it  might  also  be 
suitable  for  making  draws  from  the  posterior  using  the  accept-reject  method  presented 
in  Section  12.8.  However,  this  method  can  be  very  inefficient  as  a  high  percentage  of 
draws  may  be  rejected. 
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Instead,  sequential  draws  are  made  yielding  simulated  values  that,  if  the  sequence 
is  run  long  enough,  converge  to  a  stationary  distribution  that  coincides  with  the  tar¬ 
get  posterior  density  p(0\y).  The  method  is  called  Markov  chain  Monte  Carlo 
(MCMC),  because  it  involves  simulation  (Monte  Carlo)  and  the  sequence  is  that  of 
a  Markov  chain.  After  convergence  of  the  chain,  S  sequential  draws  can  be  used  to 
compute  summary  measures  for  the  posterior,  such  as  estimating  E[d/  ]  by  E [64]  = 
S  1  0'k.  The  draws  are  positively  correlated,  however,  so  the  precision  of  the  esti¬ 

mate  will  be  reduced  for  given  S  because  its  estimated  variance  will  exceed  the  usual 

(s-ir'E^-E^])2. 

The  sequential  method  entails  constructing  a  Markov  chain.  Two  widely  used  al¬ 
gorithms  are  the  Gibbs  sampler  and  the  Metropolis-Hastings  algorithm,  the  former 
being  a  special  case  of  the  latter,  see  Hastings  (1970).  Excellent  detailed  treatments  of 
the  subject  can  be  found  in  Gelman  et  al.  (1995),  Gamerman  (1997),  and  Robert  and 
Casella  (1999).  What  follows  is  a  bare-bones  sketch. 


13.5.1.  Markov  Chains 


Before  presenting  the  Gibbs  sampler  and  the  Metropolis-Hastings  algorithm  we  pro¬ 
vide  some  key  definitions  and  concepts  used  in  the  MCMC  literature.  These  definitions 
are  given  in  the  context  of  a  model  with  discrete  states.  They  can  be  extended  to  the 
continuous  state  model,  relevant  to  applications  where  the  posterior  is  continuous  in 
the  parameters. 

A  Markov  chain  is  defined  as  a  sequence  of  random  variables  x„  (n  =0,  1,2,...), 
where  x„  takes  values  in  a  finite  space  A,  together  with  a  transition  kernel  K(-)  that 
defines  the  probability  that  xn  equals  a  particular  value  given  previous  values  v„_;.  We 
consider  a  Markov  chain  with  the  property  that 

Pr  [xn+\  =  x\xn,  xn_] , . . .  ,x0]  =  Pr  [x„+i  =  x\xn] ,  (13.45) 

so  that  the  distribution  of  xn+\  given  the  past  is  completely  determined  only  by  the 
preceding  value  xn.  The  transition  kernel  is  a  transition  matrix  T  with  element 


txy  =  Pr  [x„+ 1  =  y\x„  =  x]  ,  (13.46) 

which  informally  is  the  probability  of  transition  from  x  to  y.  For  a  finite-state  Markov 
chain  the  set  A  of  values  (or  states)  that  x„  may  take  is  finite  with,  say,  m  elements. 
Then 


h\ 


T  = 


tm  1 


(13.47) 


with  Yl"j=i  bj  =  1)  i  =  1 , ,m. 

Now  consider  the  transition  from  x  to  y  in  n  steps  (stages).  The  transition  probabil¬ 
ity  is  given  by  T",  the  72 -times  matrix  product  of  T.  The  rows  of  the  matrix  T"  give  the 
marginal  distribution  across  the  m  states  at  the  77th  stage,  and  the  / th  row  vector  t  •  = 
(tjl  , . . . ,  tj m)  gives  the  marginal  distribution  of  transition  probabilities  from  state  j  to 
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the  other  states  at  stage  n .  If  the  initial  distribution  of  transition  probabilities  is  denoted 
t®,  then  t1" 1  =  t^0|T"  =  f"  1  ,rl’.  So  the  marginal  distribution  of  transition  probabilities 
at  the  nth  stage  is  determined  solely  by  the  initial  distribution  and  the  transition  matrix. 

In  the  Markov  simulation  context,  the  asymptotic  behavior  of  the  chain  as  n  — »■ 
oo  is  of  interest.  The  chain  is  said  to  yield  a  stationary  distribution  or  invariant 
distribution  with  transition  probabilities  txy  if 


J2txTx,y  =  ty  Wye  A,  (13.48) 

xeA 


where  transition  is  from  state  tx  to  ty.  Then  applying  the  transition  matrix  leads  to 
no  change  in  the  marginal  distribution  of  transition  probabilities.  The  existence  and 
uniqueness  of  a  stationary  distribution  is  an  important  issue. 

If  the  stationary  distribution  exists,  and  if  lim„  >00  tvT"  =  ty,  then  the  chain  will 
asymptotically  approach  tv  independently  of  the  initial  distribution.  In  this  sense  tv  is 
a  limiting  distribution.  Although  here  the  stationary  distribution  is  defined  for  a  finite- 
state  Markov  chain,  MCMC  methods  can  handle  Markov  chains  that  are  not  finite 
state;  see  Gilks,  Richardson,  and  Spiegelhalter  (1996,  pp.  60-61). 

A  state  y  may  be  recurrent  or  transient.  A  recurrent  state  is  one  that  will  be  revis¬ 
ited  with  probability  one,  and  a  transient  state  is  one  that  will  not  be  revisited  with 
some  positive  probability. 

For  Bayesian  applications  the  goal  is  to  obtain  draws  from  the  posterior  p{9).  Ap¬ 
plying  a  Markov  chain  to  obtain  these  draws,  the  initial  value  of  a  parameter  vector, 
9{0)  (which  is  analogous  to  the  distribution  of  states),  is  assigned  or  sampled  from 
the  transition  kernel.  Using  a  suitable  method  of  drawing  pseudo-random  numbers,  a 
new  vector  of  values  9{l)  is  drawn  from  the  transition  kernel  evaluated  at  0>0) ,  that  is, 
K (8'ir>).  At  the  nth  stage  the  draws  are  from  a  transition  kernel  K(9<n  '  ’)  and  so  forth. 
The  Markov  chain  used  is  one  such  that  as  n  oo  the  limiting  distribution  is  the  pos¬ 
terior  p(6).  Once  convergence  to  the  limiting  distribution  occurs  all  subsequent  draws 
are  also  from  this  distribution,  though  they  will  be  correlated. 

These  ideas  provide  the  intuitive  basis  for  a  class  of  MCMC  procedures  that  can  be 
used  to  recover  Bayesian  posterior  distributions  for  many  different,  and  possibly  high¬ 
dimensional,  models  such  as,  for  example,  the  linear  hierarchical  models  discussed  in 
Section  13.3.4.  Provided  that  one  specifies  a  transition  kernel  K{9("~  ’,  •)  from  which 
draws  of  9  can  be  made  and  within  which  is  embedded  the  chain’s  limiting  distribution, 
the  target  posterior  distribution  can  be  recovered  in  the  sense  of  being  approached 
arbitrarily  closely. 

The  current  description  is  at  a  very  general  level.  In  practice,  the  choice  of  the  tran¬ 
sition  kernel  is  not  unique  and  there  are  many  possible  chains  one  can  construct.  Some 
choices  may  be  better  than  others  in  terms  of  speed  of  convergence  to  the  limiting 
distribution.  If  convergence  is  found  to  be  very  slow  and  computationally  expensive, 
alternative  chains  may  need  to  be  substituted.  Clearly,  criteria  are  needed  to  determine 
whether  convergence  has  occurred  and  how  close  to  the  target  distribution  the  chain  is 
at  the  nth  stage. 
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13.5.2.  Gibbs  Sampler 

We  begin  with  the  Gibbs  sampler,  a  member  of  the  MCMC  class  that  is  easy  to  describe 
and  implement. 

Let  0  =  [ 0 1  02];  have  posterior  density  p{9)  =  p(0 1,  62),  where  for  notational  sim¬ 
plicity  we  suppress  dependence  on  y.  If  the  conditional  densities  are  known,  which  is 
not  guaranteed  as  knowledge  of  both  p(#i|02)  and  p(02 |#i)  is  necessary,  then  alter¬ 
nating  sequential  draws  from  p{6 \\02)  and  p(02\0\)  in  the  limit  converge  to  draws 
from  p{0\,  62). 


Example 

A  simple  illustration  is  to  consider  bivariate  normal  data  with  uniform  prior  for  the 
mean  and  known  covariance  matrix.  Let  y  =  (yi,  y2)  ~  J\f[0,  S],  where  0  =  \0\  On  ]' 
and  S  has  diagonal  entries  1  and  off-diagonal  entries  p.  Then  given  a  uniform  prior 
for  6  the  posterior  can  be  shown  to  be  0\y  ~  A/"[y,  /V  ~ 1 X  |.  a  bivariate  normal.  Since 
the  conditional  posterior  distributions  are 

ei\e2,  y  ~  U  [(yi  +  p  (02  -  m ,  (1  -  P2)/N] . 

02\0uy  ~  M [(y2  +  p  (0i  -  yi)) ,  (l  -  p2)/n]  . 

we  can  iteratively  sample  from  each  conditional  normal  distribution  using  updated 
values  of  6\  and  62 .  If  the  chain  is  run  long  enough  then  it  will  converge  to  the  bivariate 
normal.  In  this  example  it  is  easy  to  make  direct  draws  from  the  joint  posterior  of  0|y, 
using  Choleski’s  transformation  given  in  Section  12.8,  but  in  other  examples  it  can  be 
possible  to  draw  from  the  conditionals  but  not  the  joint  posterior. 


Gibbs  Sampler 

More  generally,  consider  a  ^-dimensional  target  distribution  p(6),  where  the  notation 
suppresses  the  dependence  on  data.  Suppose  that  6  is  partitioned  into  d  blocks.  For 
example,  6'  =  [/3  a1]'  in  a  linear  regression  example.  Let  Ok  denote  the  Ath  block 
and  0  ^  denote  all  components  of  6  aside  from  Ok-  Assume  that  the  full  conditional 
distributions  p{0k\0-k),  k  =  1, . . . ,  d,  are  known.  Then  sequential  sampling  from  the 
full  conditionals  can  be  set  up  as  follows: 

1.  Let  the  initial  values  of  6  be  0{Q)  —  (0j°\  ....  0^). 

2.  The  next  iteration  involves  sequentially  revising  all  components  of  6  to  yield  O' 1 1  = 
(0\u , . . . ,  0'tj ')  generated  using  d  draws  from  the  d  conditional  distributions  as  follows: 

i’(e"'  i< . O 

p  (ejy,11.  . . ., «?) 

. <!,). 
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3.  Return  to  step  1,  reinitialize  the  vector  6  at  0>]> ,  and  cycle  through  step  2  again  to  obtain 

the  new  draw  6{1] .  Repeat  the  steps  until  convergence  is  achieved. 

Gilks  et  al.  (1996,  p.  7)  provide  a  sketch  of  the  proof  of  the  statement  that  the 
stationary  distribution  is  the  posterior.  After  convergence  the  draws  are  from  the  target 
joint  posterior.  Geman  and  Geman  (1984)  showed  that  the  stochastic  sequence  {0ln>} 
is  a  Markov  chain  with  the  correct  stationary  distribution.  Gelfand  and  Smith  (1990) 
showed  that,  under  some  conditions,  as  the  number  of  cycles  of  draws  from  the  full 
set  of  conditionals  tends  to  infinity,  the  chain  converges  to  the  stationary  posterior 
distribution.  See  also  Tanner  and  Wong  (1987).  Once  convergence  occurs,  numerous 
draws  can  be  made  and  used  to  calculate  sample  analogues  of  the  posterior  moments 
of  marginal  or  joint  distributions. 

The  results  mentioned  here  do  not  tell  us  how  many  cycles  are  needed  for  conver¬ 
gence,  which  is  model  dependent.  It  is  very  important  to  ensure  that  sufficient  number 
of  cycles  are  executed  for  the  chain  to  converge.  A  variety  of  diagnostic  tests  of  con¬ 
vergence  are  available.  Because  estimates  of  posterior  moments  should  be  based  on 
draws  from  the  posterior  distribution  it  is  standard  practice  to  discard  the  earlier  results 
from  the  chain,  the  so-called  burn-in  phase. 

Sequential  simulation  algorithms  can  be  modified  so  that  each  draw  depends  not 
simply  on  the  immediately  preceding  draw  but  also  on  earlier  draws,  the  key  require¬ 
ment  being  that  probability  of  improvement  on  the  current  approximation  to  the  pos¬ 
terior  should  be  positive  and  (preferably)  high.  The  attraction  of  the  more  restrictive 
Markovian  property  is  that  it  facilitates  the  proof  that  the  transition  distributions  con¬ 
verge  to  the  target  posterior. 

For  Bayesian  analysis  the  Gibbs  sampler  is  useful  when  the  joint  posterior  is  in¬ 
tractable  but  the  full  conditional  distributions  are  available  in  a  convenient  form.  Many 
applications  use  considerable  ingenuity  and  knowledge  of  conjugate  priors  and  related 
Bayesian  results,  many  from  the  earlier  presimulation  literature,  to  specify  priors  that 
lead  to  known  full  conditional  distributions. 

We  consider  two  examples  that  apply  the  MCMC  methods. 


Linear  Regression  Example 

In  Section  13.3.2  we  analyzed  the  posterior  distribution  of  the  normal  linear  ho- 
moskedastic  regression  model,  given  normal-gamma  conjugate  priors.  The  conditional 
posterior  of  / 3  given  a  2  was  shown  to  be  multivariate  normal,  and  the  conditional  pos¬ 
terior  of  a ~2  given  (3  is  the  gamma.  Even  though  integration  is  feasible  and  we  can 
derive  the  posterior  in  an  explicit  form  (see  (13.32))  it  is  actually  easier  to  use  the 
Gibbs  sampler  to  draw  a  large  sample  from  the  joint  posterior  distribution.  The  chain 
consists  of  recursive  draws  from  the  normal  conditional  on  the  precision  parameter 
cf  ~  and  from  the  gamma  distribution  conditional  on  the  (3. 

The  structure  of  the  algorithm  resembles  that  given  later  in  Section  13.6  for  a 
slightly  more  complicated  case  of  a  two-equation  seemingly  unrelated  regressions 
model. 
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In  many  cases  it  would  be  natural  to  work  with  blocks  of  parameters.  For  example, 
in  a  multiequation  multivariate  linear  regression  model  with  a  nondiagonal  contem¬ 
poraneous  covariance  matrix,  the  conditional  mean  parameters  (/3, ,  /32, .  ■  ■)  form  one 
block  of  parameters,  and  £  forms  a  second.  Then  the  full  conditional  distributions 
will  have  the  form  /3j,  /32, . .  .| data,  £  and  £| data,  (3l,  /32, ....  Chib  and  Greenberg 
(1996,  pp.  418^-19)  outline  the  Gibbs  algorithm  for  this  case. 


Hierarchical  Prior  Example 

The  Gibbs  sampler  has  been  deployed  with  much  success  in  the  analysis  of  the  hi¬ 
erarchical  prior  model.  From  the  structure  of  the  linear  hierarchical  model  given  in 
( 1 3.39)— ( 1 3.4 1),  it  can  be  seen  that  formulating  a  Markov  chain  based  on  a  full  set  of 
conditionals  is  feasible  in  this  case.  The  same  general  approach  can  be  extended  to  a 
nonlinear  hierarchical  prior  model,  although  some  additional  steps  are  necessary  if  the 
nonlinearity  occurs  in  conjunction  with  a  latent  variable  model  (Albert,  1988). 


13.5.3.  Metropolis  Algorithm 


The  Gibbs  sampler  is  the  best-known  MCMC  algorithm.  Its  applicability  is  limited, 
however,  as  it  requires  direct  sampling  from  the  full  conditional  distributions,  which 
may  not  be  known.  Two  extensions  that  allow  the  MCMC  to  be  applied  more  gener¬ 
ally  are  the  Metropolis  algorithm  and  the  Metropolis-Hastings  algorithm.  Chib  and 
Greenberg  (1995)  provide  a  tutorial  and  references.  The  following  summary  is  sim¬ 
pler  but  avoids  many  details  that  are  necessary  if  the  reader  seeks  a  more  complete 
understanding. 

The  Metropolis  algorithm  constructs  a  sequence  {0("\  n  =  1,2,...}  whose  distri¬ 
butions  converge  to  the  target  posterior,  assumed  to  be  computable  up  to  a  normalizing 
constant. 

For  notational  simplicity  we  again  suppress  dependence  of  p  (0|y)  on  y.  The  algo¬ 
rithm  consists  of  the  following  steps: 

1.  Draw  a  starting  point  0<o>  from  an  initial  approximation  to  the  posterior  for  which 
P(e(0) )  >  0.  For  example,  the  draw  may  be  from  a  multivariate  f-distribution  centered 
on  the  mode  of  the  marginal  posterior  distribution. 

2.  Next  set  n  =  1.  Draw  6*  from  a  symmetric  jumping  distribution  J\ (0(]>\6'lyi),  with 
the  property  that  for  any  arbitrary  pair  (0" .  0b).  Jn(6a\Gh)  —  Jn(0b\0a).  An  example  is 
0(1)|0<O)  ~  N[0<{)\  V]  for  some  fixed  Y.  Symmetry  of  the  jumping  distribution  leads  to 
simplicity  but  is  not  otherwise  essential. 

3.  Calculate  the  ratio  of  densities  r  —  p(0* )/ p(0">]). 

4.  Set 


0<»  = 


6*  with  probability  min(r,  1), 

0(O)  with  probability  (1  —  min(r,  1)) , 


which  means  that  the  draw  6{  1 '  is  a  draw  from  a  mixture  distribution  with  components 
6*  and  0°. 
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5.  Return  to  step  2,  increase  the  counter,  and  repeat  the  following  steps. 

6.  After  a  suitably  large  number  of  iterations  apply  the  necessary  checks  for  the  conver¬ 
gence  of  the  distribution.  If  convergence  has  occurred  the  target  posterior  has  been 
recovered. 

This  algorithm  can  be  viewed  as  an  iterative  method  to  maximize  p(9).  If  9*  in¬ 
creases  p{9)  then  9in)  =  9*  always,  whereas  if  9*  decreases  p(9)  then  9'"'  =  9*  with 
probability  r  <  1 . 

The  algorithm  is  similar  in  spirit  to  accept-reject  sampling  (see  Section  12.8), 
though  there  is  no  requirement  here  that  a  fixed  multiple  of  the  jumping  distribution 
always  covers  the  posterior. 

The  Metropolis  algorithm  generates  a  Markov  chain  that  has  properties  of  re¬ 
versibility,  irreducibility,  and  Harris  recurrence  that  ensure  convergence  to  a  stationary 
distribution.  Gelman  et  al.  (1995)  demonstrate  that  this  stationary  distribution  is  the 
desired  posterior  p{9)  as  follows.  Let  9 a  and  9b  be  two  points  such  that  p(9b)  > 
p(9a).  If  0U'-  1 1  =  9 a  and  9*  =  Of,  then  9[,,)  =  9/,  with  certainty  and  Pr[ 9'"'  = 
9b,  =  9a]  =  J„(9b\9a)p(9a).  If  the  order  is  reversed  and  9(n~l)  =  9b  and 

9*  =  9a,  then  9in)  =  9a  with  probability  r  =  p{9a)/ p{9b)  and  Pr[0('n  =  9a,  9(n~l)  = 
9b]  =  Jn(9a\9b)p(9b)[p(9a)/p(9b)]  =  U9a\9h)p{9a)  =  Jn{9b\9a)p{9a)  given  the 
assumption  of  symmetric  jumping  distribution.  The  marginal  distributions  of  9{n)  and 
9("  11  are  therefore  equal,  since  their  joint  distribution  is  symmetric,  so  p(6)  is  the 
symmetric  stationary  distribution  of  the  Markov  chain. 


13.5.4.  The  Metropolis-Hastings  Algorithm 

The  performance  of  the  Metropolis  algorithm  varies  with  the  choice  of  initial  approxi¬ 
mating  distribution  and  choice  of  jumping  distribution.  A  potential  problem  is  that  the 
Metropolis  algorithm  may  be  slow,  as  would  be  the  case  if  the  move  from  the  current 
to  a  new  value  is  not  made  sufficiently  often,  causing  the  chain  to  move  infrequently. 
The  algorithm  can  be  speeded  up  by  permitting  use  of  jumping  distributions  that  are 
not  symmetric. 

The  Metropolis-Hastings  (M-H)  algorithm  is  the  same  as  the  Metropolis  algo¬ 
rithm,  except  that  in  step  2  the  jumping  distribution  need  not  be  symmetric,  and  in 
step  3  the  acceptance  probability  r  for  general  n  becomes 


p(9*)/Jn(9*  |fl(,,~1))  _  p(9*)Jn(9in-1)\9*) 

p(9(n-l))/J„(9(n~1)\9*)  p(9(n-1))M9*\9^n-l))' 

The  remaining  steps  are  executed  with  this  revised  definition.  Note  that  if  any  normal¬ 
izing  constants  are  present  in  either  p(-)  or  /„(•),  then  they  cancel  in  this  definition 
of  r„.  So  both  posterior  and  jumping  probabilities  need  only  be  computed  up  to  this 
constant.  See  Hastings  (1970). 
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13.5.5.  M-H  Examples 

Different  jumping  distributions  lead  to  different  M-H  algorithms  with  different  ef¬ 
ficiency  in  terms  of  the  number  of  draws  needed  to  obtain  the  desired  draws  from 
the  posterior.  We  give  several  examples,  noting  that  there  are  few  general  guidelines 
available  for  choice  of  jumping  distribution,  except  to  use  the  Gibbs  sampler  wherever 
possible. 

The  Gibbs  sampler  is  a  special  case  of  the  M-H  algorithm.  If  9  is  partitioned  into  d 
blocks,  then  there  are  d  Metropolis  steps  at  the  nth  step  of  the  algorithm.  The  jumping 
distribution  is  the  conditional  distribution  given  in  Section  13.5.2  and  it  can  be  shown 
that  the  acceptance  probability  is  always  1 .  Gibbs  sampling  is  also  called  alternating 
conditional  sampling. 

It  is  possible  to  use  mixed  strategies,  whereby  different  transition  kernels  are  used 
for  different  subsets  of  parameters.  For  example,  an  M-H  step  can  be  combined  with 
a  Gibbs  sampler,  the  latter  being  used  for  components  for  which  direct  sampling  is 
feasible. 

The  independence  chain  makes  all  draws  from  a  fixed  density  g  (9),  say,  in  which 
case  the  acceptance  probability  simplifies  to  the  ratio  r„  =  w(9*)/w(9{n~l))  of  impor¬ 
tance  weights  u>(9)  =  p(9)/g(9).  A  random  walk  chain  sets  the  draw  9*  =  91"  11  + 
e,  where  e  is  a  draw  from  g(e). 

Gelman  et  al.  (1995,  p.  334)  consider  simulating  the  ^-variate  normal  with  variance 
X.  For  a  Metropolis  algorithm  with  jumping  distribution  0*|0(,!_1)  ~  A c2X], 
the  choice  c  —  lA/^fq  leads  to  greatest  efficiency  relative  to  direct  draws  from  the 
g-variate  normal.  The  efficiency  is  about  0.3,  compared  to  l/q  for  the  Gibbs  sampler 
in  the  case  that  X  =  a2\q. 


13.6.  MCMC  Example:  Gibbs  Sampler  for  SUR 


We  illustrate  the  application  of  the  Gibbs  sampler  to  the  analysis  of  the  seemingly 
unrelated  regression  model.  This  example  is  slightly  more  challenging  than  an  ap¬ 
plication  to  single-equation  regression,  because  errors  correlated  across  equations  are 
introduced. 

We  consider  a  two-equation  example  with  / th  observation 

yu  =  xj  if3l  +  su, 
yii  =  x2 ,iPl  +  £2m 

where  (ei,  £2)  are  bivariate  normal  with  zero  mean  and  covariance  matrix 


X  = 


(Til  0\2 

ail  G22 


Combining  the  two  equations  gives  the  /th  observation 


y;  =  X-/3  +  £;, 
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where  £,~A/"[0,  X].  In  summary,  the  dgp  is 


y,-|x;,/3,S~AA[x;/3,  X] 


and  interest  lies  in  estimating  the  posterior  means  of  the  regression  parameters  (3  and 
variance  parameters  S,  given  data  y,  X. 

We  consider  independent  informative  priors,  with 

E_1  ~  Wishartfuo,  Dq], 


where  Bo  is  defined  as  precision,  the  inverse  of  the  prior  variance,  and  the  inverse 
Wishart,  defined  in  Section  13.3.5,  is  a  generalization  of  the  inverse  gamma.  An  al¬ 
ternative  approach,  not  taken  here,  uses  dependent  priors  similar  to  those  in  Section 
13.3.2,  in  which  case  (3\ X  ~  A/"[/30,  for  specified  coo- 

Performing  some  algebra  yields  the  conditional  posteriors 


/3|X,y,X~A r 


Co  Boft+Ex'r'y  ,C, 


X_1|/3,  y,  X  ~  Wishart 


:=i 

vo  +  N,  (dq1  +  £  u,  u, 


-r 


where  Co  =  (Bo  +  x'X  1  x ,  > —  1  and  u,  =  y,  —  x-/3.  The  Gibbs  sampler  can  be 
used  since  the  conditional  posteriors  are  known  and  sampling  from  both  distributions 
is  straightforward. 

For  a  simulation  example  we  let  the  regressors  in  each  equation  be  an  inter¬ 
cept  plus  a  single  scalar  regressor,  different  in  the  two  equations,  generated  from  a 
standard  normal.  Then  y\  and  V2  are  generated  with  the  four  regression  parameters 
j-’>\  i  =  jJ>\2  =  fJn\  =  P22  =  1,  the  error  variances  ern  =  022  =  1,  and  the  error  covari¬ 
ance  a] 2  =  <721  =  —0.5.  The  sample  size  is  either  N  =  1,000  or  N  =  10,000.  Given 
these  data,  we  present  Bayesian  estimates  of  the  parameters,  where  the  prior  distri¬ 
butions  set  j30  =  0,  Bq  1  =  rl,  Dq  =  I,  and  uo  =  5.  To  check  the  impact  of  different 
priors  three  values  of  r  are  considered,  r  =  10,  r  =  1,  and  r  =  1/10,  with  smaller 
values  of  r  corresponding  to  tighter  priors. 

The  Gibbs  sampler  makes  draws  recursively  from  the  conditional  posterior  distri¬ 
butions.  We  reject  the  first  5,000  replications  that  constitute  the  “bum-in”  phase  and 
report  results  using  the  subsequent  50,000  and  100,000  replications. 

A  selection  of  the  results  is  given  in  Table  13.3,  which  reports  the  mean  and  variance 
of  the  marginal  posterior  distribution  of  each  coefficient  in  five  different  samples  that 
themselves  are  independent  draws.  The  first  three  columns  present  a  sensitivity  anal¬ 
ysis  for  different  values  of  r,  which  shows  that  the  results  are  not  very  sensitive.  The 
fourth  column,  compared  to  the  first,  shows  that  doubling  the  number  of  replications 
has  very  little  effect.  The  fifth  column,  compared  to  the  first,  shows  that  increasing  the 
sample  size  tenfold  to  100,000  greatly  increases  the  precision  as  expected,  reducing 
the  standard  deviation  of  the  coefficient  by  a  factor  of  more  than  3,  but  with  relatively 
small  impact  on  the  point  estimates. 
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Table  13.3.  Gibbs  Sampling:  Seemingly  Unrelated  Regressions  Example a 


Prior  parameter  r 

Sample  size  N 

Gibbs  sample  replications 

T  =  10 
1,000 
50,000 

T  —  1 

1,000 

50,000 

T  =  1/10 
1,000 
50,000 

T  =  10 
1,000 
100,000 

T  =  10 
10,000 
100,000 

/In  (eq.  1  intercept) 

0.971 

1.013 

0.983 

1.020 

1.010 

(0.0310) 

(0.0312) 

(0.0316) 

(0.0324) 

(0.0100) 

/li2  (eq.  1  slope) 

1.026 

0.9835 

1.006 

1.006 

1.015 

(0.0265) 

(0.0271) 

(.0265) 

(.0268) 

(0.0086) 

/Li  (eq.  2  intercept) 

1.016 

0.972 

0.993 

1.017 

0.991 

(0.0309) 

(0.0325) 

(0.0322) 

(0.0326) 

(0.0100) 

P22  (eq-  2  slope) 

0.983 

0.992 

0.979 

1.005 

1.007 

(0.0256) 

(0.0285) 

(0.0272) 

(0.0277) 

(0.0085) 

o\  1  (eq.  1  error  variance) 

0.960 

0.969 

1.012 

1.043 

1.010 

(0.0429) 

(0.0434) 

(0.0453) 

(0.0466) 

(0.0143) 

CT12  (error  covariance) 

-0.499 

-0.507 

-0.519 

-0.576 

-0.515 

(0.0340) 

(0.0358) 

(0.0368) 

(0.0379) 

(0.0113) 

CT22  (eq.  2  error  variance) 

0.950 

1.066 

1.049 

1.062 

1.002 

(0.425) 

(0.0476) 

(0.0467) 

(0.0472) 

(0.0141) 

°  Model  is  a  two-equation  seemingly  unrelated  regression.  Table  gives  the  mean  and  standard  deviation  of  the 
posterior  distribution  for  each  parameter.  Smaller  values  of  r  correspond  to  tighter  priors. 


One  way  to  check  for  convergence  is  to  look  at  the  means  and  standard  deviations 
of  the  output  and  see  whether  they  drift  or  stay  at  the  same  level.  If  the  change  is 
small,  say  less  than  0.1  for  10,000  replications,  then  convergence  is  presumed.  One 
also  might  look  at  several  chains  at  a  time.  The  draws  will  always  be  correlated  but  the 
important  question  is  how  fast  the  autocorrelation  function  decays  to  zero.  Sometimes 
this  problem  cannot  be  fixed  and  it  is  simply  inherent  to  the  algorithm.  One  can  also 
take  every  tenth  or  hundredth  observation  to  purge  serial  correlation. 

To  check  whether  the  Gibbs  sampler  has  converged  to  the  stationary  posterior  dis¬ 
tribution  in  the  present  case,  we  compute  the  first  20  autocorrelation  coefficients  of 
draws  from  the  posterior  after  convergence  for  each  coefficient.  Lack  of  convergence 
would  be  indicated  by  the  presence  of  serial  correlation  in  the  draws  from  the  target 
distribution.  When  the  number  of  replications  is  small,  say  1 ,000,  the  autocorrelation 
coefficients  are  found  to  be  as  high  as  0.06  in  some  cases.  However,  when  the  number 
of  replications  is  50,000  and  greater,  there  is  virtually  no  evidence  of  serial  correlation 
up  to  order  20,  and  correlation  disappears  with  the  order.  In  most  cases  the  estimates 
are  smaller  than  0.005.  It  is  easy  to  verify  that  for  N  =  1,000,  the  prior  parameters  r 
has  very  little  impact  on  the  posterior.  This  computation  is  very  simple  and  takes  little 
more  than  a  few  seconds. 


13.7.  Data  Augmentation 

The  Gibbs  sampler  can  sometimes  be  applied  to  a  wider  range  of  models  by  introduc¬ 
tion  of  auxiliary  variables.  In  particular,  this  is  the  case  for  models  involving  latent 


454 


13.7.  DATA  AUGMENTATION 


variables,  such  as  discrete  choice  models,  truncated  and  censored  models,  and  finite 
mixture  models  introduced  in  later  chapters. 

In  the  scalar  case  the  latent  dependent  variable  y*  is  not  observed;  instead,  we  ob¬ 
serve  only  y  =  g(y*)  for  some  specified  function  y.  For  example,  in  a  logit  or  probit 
model  (see  Chapter  14)  we  may  observe  only  whether  y*  is  positive  or  negative,  in 
which  case  y  =  l(y*  >  0)  and  we  observe  v  =  1  if  y*  >  0  and  y  =  0  if  y*  <  0. 

Bayesian  analysis  of  latent  variable  models,  and  especially  the  application  of  the 
Gibbs  sampler,  is  greatly  aided  by  the  replacement  of  the  latent  variable  by  imputed 
values.  This  step  is  feasible  if  we  can  write  down  the  predictive  density  of  the  latent 
variables  in  terms  of  the  observed  variables.  The  procedure  of  adding  imputed  values 
as  if  they  were  observed  data  is  called  data  augmentation.  (An  example  was  given 
in  Section  10.3.7  where  the  EM  algorithm  was  exposited.)  The  essential  insight,  due 
to  Tanner  and  Wong  (1987),  is  that  the  posterior  based  only  on  the  observed  data  is 
intractable,  but  that  obtained  after  data  augmentation  is  often  tractable  using  the  Gibbs 
sampler. 

Consider  the  posterior  expressed  in  terms  of  both  directly  observed  variables  y  and 
the  latent  variables  y*. 


P(0 |y)  =  /  P(0\y,y*)f(y*\y)dy*,  (13.49) 

J  y* 

where  the  right-hand- side  integral  may  be  interpreted  as  an  averaging  operation  with 
respect  to  y*. 

Analogous  to  the  EM  method,  data  augmentation  involves  cycling  between  an  im¬ 
putation  step,  I-step,  and  a  posterior  step,  P-step. 

In  the  imputation  step  we  make  draws  from  the  full  conditional  density  of  v* .  This 
averages  over  the  parameters  ip  that  appear  in  the  probability  distribution  that  links  v* 
and  y.  The  predictive  distribution  is 


/(y*ly)=  [  f(y*\y,ip)f(ip\y)dip.  (13.50) 

J 

Given  the  current  draw  from  p(G |y)  we  can  make  a  draw  of  v*  from  /(y*|y),  repeating 
both  parts  of  the  step  m  times  to  generate  m  multiple  imputations  y* ,  i  =  1 , ,m. 
This  completes  the  I-step. 

Given  the  augmented  data  from  the  I-step,  the  P-step  is  implemented  by  updating 
the  current  approximation  to  p(9 |y);  thus, 


j  m 

updated  p(G |y)  =  —  Y]  p(G |y,  y*).  (13.51) 

m  r— i 

i= l 


Then  the  algorithm  returns  to  the  I-step. 

If  m  =  1.  the  procedure  amounts  to  performing  integration  in  (13.49)  by  Gibbs 
sampling.  If  m  is  chosen  to  be  sufficiently  large,  the  posterior  distribution  is  approx¬ 
imated  better.  An  extended  example  of  the  data  augmentation  method  applied  to  the 
missing  data  problem  is  given  in  Chapter  26. 
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13.8.  Bayesian  Model  Selection 


Chapters  7  and  8  dealt  with  issues  of  hypothesis  testing,  specification  diagnostics,  and 
model  comparison  from  a  frequentist  viewpoint.  In  this  section  we  consider  the  prin¬ 
cipal  tool,  Bayes  factors,  that  is  used  in  Bayesian  analysis  to  evaluate  the  strength 
of  evidence  in  favor  of  the  null  hypothesis  (model).  It  also  serves  as  a  criterion  for 
model  selection,  irrespective  of  whether  nested  or  nonnested  pairs  of  models  are  un¬ 
der  consideration.  In  the  econometrics  literature,  Zellner  (1971,  1978)  provided  an 
early  discussion  in  the  context  of  model  selection.  Our  treatment  is  based  on  Kass  and 
Raftery’s  (1995)  review  article. 

Denote  the  data  by  y  and  the  two  hypotheses  under  consideration,  possibly 
nonnested,  by  H\  and  Hi.  Prior  probabilities  of  the  two  hypotheses  are  Pr[  H \  |  and 
Pr[ Hi \ .  The  corresponding  dgps  are  Pr[v|  H\  ]  and  Pr[y|//2]  =  1  —  Pr[y|  H\  |.  The  prior 
probabilities  of  the  models  are  transformed  to  posterior  probabilities  by  the  sample  ev¬ 
idence  as  reflected  in  the  likelihood.  By  Bayes’  Theorem 


Pr[f4|y]  = 


Pr[y|gt]Pr[gt] 

Pr[y|  H \  ]Pr[/7 \  ]  +  Pr[y|ff2]Pr[//2]  ’ 


k  =  1,2, 


and  the  posterior  odds  ratio 

Pr[//r|y]  =  Fr[y|//|  JPr|//|  ]  _  Pr [H{\ 
Pr[H2\y]  Pr[y|ff2]Pr[ff2]  “  12  Pr  [Hi]’ 


(13.52) 


(13.53) 


where  B\2  =  Pr[y|//i]  /Pr[y|  H2\.  is  called  the  Bayes  factor.  Hypothesis  1  is  preferred 
if  the  posterior  odds  ratio  exceeds  one.  The  right-hand  side  of  (13.53)  expresses  the 
posterior  odds  ratio  as  the  product  of  the  Bayes  factor  and  the  prior  odds.  If  a  priori  the 
two  models  are  equally  probable,  so  Pr[H\  ]  =  Pr[//2],  then  the  Bayes  factor  equals 
the  posterior  odds  in  favor  of  H\.  If  several  hypotheses  are  involved  the  Bayes  factor 
can  be  computed  for  all  pairs  of  hypotheses.  The  Bayes  factor  is  defined  even  if  the 
hypotheses  are  not  nested. 

The  Bayes  factor  has  the  form  of  a  likelihood  ratio.  It  depends  on  unknown  parame¬ 
ters,  denoted  by  vectors  0\  and  62,  that  are  eliminated  by  averaging  or  integrating  over 
the  parameter  space  with  respect  to  the  prior,  so 


Pr[y| flit]  =  J  Pr[y| 0k,  Hk]n  {Ok\Hk)dO,  k  =  1,2.  (13.54) 

From  Section  13.2.5,  this  equation  gives  the  marginal  and  the  predictive  probability  of 
the  data  given  the  prior  distribution. 

A  complication  is  that  this  expression  depends  on  all  the  constants  that  appear  in  the 
likelihood.  These  constants  can  be  neglected  when  evaluating  the  posterior,  but  they 
are  required  for  the  computation  of  the  Bayes  factor.  The  integral  in  (13.54)  may  need 
to  be  numerically  evaluated  if  it  does  not  have  an  explicit  solution  using,  for  example, 
importance  sampling.  There  is  a  substantial  literature,  reviewed  in  Kass  and  Raftery 
(1995),  on  the  computation  of  the  Bayes  factor  that  we  will  not  pursue  here.  We  note 
that  there  are  some  asymptotic  approximations  to  the  Bayes  factors  that  are  readily 
computable  using  output  from  packages  that  maximize  likelihoods. 
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Table  13.4.  Interpretation  of  Bayes  Factors 


Bayes  Factor  B]2 

21n(Z?12) 

Evidence  against  Hi 

1  to  3 

0  to  2 

weak 

3  to  20 

2  to  6 

positive 

20  to  150 

6  to  10 

strong 

>150 

>10 

very  strong 

Interpretation  of  the  Bayes  factor  is  in  terms  of  evidence  against  the  H\.  “The  Bayes 
factor  is  a  summary  of  the  evidence  provided  by  the  data  in  favor  of  one  scientific 
theory,  represented  by  a  statistical  model,  as  opposed  to  another”  (Kass  and  Raftery, 
1995,  p.  777).  In  the  frequentist  analysis  twice  the  log-likelihood  ratio  is  an  often-used 
quantity.  Similarly,  twice  the  log  of  the  Bayes  factor  is  a  criterion  used  in  evaluating 
the  evidence.  Kass  and  Raftery  present  the  following  categorization  of  the  strength  of 
evidence  against  the  null  model  that  they  have  found  useful  in  their  own  work;  see 
Table  13.4. 

Suppose  that  two  models  under  comparison  are  nested.  Denote  by  Ht)  the  con¬ 
strained  model  and  H\  the  model  that  is  unconstrained.  A  pairwise  comparison  of  the 
two  models  using  the  posterior  odds  ratio  requires  computation  of  the  Bayes  factor,  as 
shown  earlier.  The  Bayes  factor  for  the  null  hypothesis  model  is  defined  as 

m(y\ H0) 
on  I  —  - 

where  m(y\ H f)  is  the  marginal  likelihood  of  the  model  specification  Hj.  If  the  models 
Hq  and  H\  are  nested,  then  the  Savage-Dickey  density  ratio  approach  (see  Verdinelli 
and  Wasserman,  1995)  can  be  taken  to  calculate  the  Bayes  factors. 

An  important  insight  due  to  Chib  (1995)  has  made  the  computation  of  Bayes  factors 
a  great  deal  easier  than  suggested  by  the  earlier  literature,  irrespective  of  whether  the 
models  are  nested  or  nonnested.  His  approach  consists  of  two  related  ideas.  The  first 
rewrites  the  marginal  density,  for  a  given  model  Hk,  m( y)  as  a  ratio 


m(y)  = 


/(ylflMfl) 

7T('0|y) 


(13.54) 


where  the  numerator  is  the  product  of  the  density  (inclusive  of  constants)  and  the  prior, 
and  the  denominator  is  the  posterior  density  of  6.  This  result  is  a  rearrangement  of  the 
terms  in  equation  (13.1),  with  the  qualification  that  we  have  used  the  notation  m( y)  in 
place  of  /( y)  or  Pr[y|  Hk  |  used  earlier;  it  merely  states  that  the  marginal  density  is  the 
normalizing  constant.  Second,  after  a  successful  application  of  an  MCMC  algorithm, 
we  will  have  available  a  Monte  Carlo  estimate  of  the  posterior  density  estimate  7r(0|y) 
at  a  given  point  6.  Then  it  follows  that 


lnin(y)  =  ln/(y|0)  +  ln7r(0)  -ln7r(0|y).  (13.55) 


Therefore,  given  estimates  of  the  terms  on  the  right-hand  side,  the  marginal  density  can 
be  readily  computed  using  the  output  from  a  Gibbs  sampler.  This  approach  has  been 
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extended  in  Chib  and  Jeliazkov  (2001)  to  the  case  where  the  output  is  instead  from  a 
Metropolis-Hastings  algorithm. 

In  complex  and  highly  parameterized  models,  the  computation  of  the  Bayes  factor 
is  a  nontrivial  matter.  However,  it  can  be  shown  that  the  Schwarz  criterion,  also  known 
as  the  Bayes  information  criterion  (see  Section  8.5),  gives  a  rough  approximation  to 
the  log  of  the  Bayes  factor.  Recall  that  BIC  =  —  21nL(#ML)  +  In  Nq.  This  is  easy  to 
compute  if  the  value  of  the  log-likelihood  is  available. 

From  (13.52)  it  is  obvious  that  the  ratio  of  prior  probabilities  of  the  model  plays  a 
role  in  evaluating  the  evidence  against  the  null.  In  many  situations,  the  investigator  may 
have  little  to  go  on  in  assigning  these  probabilities.  This  consideration  has  received 
some  attention  in  the  literature  that  deals  with  the  sensitivity  of  the  Bayes  factor  to  the 
prior  model  probabilities. 


13.9.  Practical  Considerations 

The  use  of  Markov  chain  methods  has  now  become  dominant  in  the  Bayesian  lit¬ 
erature.  Because  the  methods  are  computer  intensive,  good  software  is  essential.  At 
the  time  of  writing,  the  WinBUGS  package,  a  later  version  of  the  BUGS  (Bayesian 
inference  Using  Gibbs  Sampling)  package  (Gilks  et  al.,  1996),  has  been  widely  rec¬ 
ommended  and  found  to  be  especially  useful  for  hierarchical  models  and  missing  data 
problems.  It  is  available  at  the  BUGS  Web  site.  More  detailed  information  about  other 
Bayesian  software  can  be  found  in  Gamerman  (1997,  Section  5.6). 

The  issue  of  how  long  to  run  the  chain  continues  to  be  an  active  area  of  research.  Di¬ 
agnostic  checks  for  convergence  are  available  and  have  been  mentioned,  but  they  often 
do  not  have  universal  applicability.  Cappe  and  Robert  (2000)  provide  a  review  of  the 
issues  of  implementation  including  stopping  rules.  The  complexity  of  the  conditional 
distributions  is  clearly  an  important  factor.  Graphs  of  output  for  scalar  parameters  from 
the  Markov  are  a  visually  attractive  way  of  confirming  convergence,  but  more  formal 
approaches  are  available  (Geweke,  1992).  Another  suggestion,  due  to  Gelman  and 
Rubin  (1992),  is  to  use  multiple  (parallel)  Gibbs  samplers,  each  beginning  with  differ¬ 
ent  starting  values  to  see  if  different  chains  converge  to  the  same  posterior  distribution. 
Zellner  and  Min  (1995)  propose  several  convergence  criteria  that  can  be  used  if  the 
posterior  can  be  written  explicitly. 

13.10.  Bibliographic  Notes 

There  are  several  excellent  book-length  treatments  with  emphasis  on  modern  computational 
methods  for  Bayesian  analysis,  including  those  by  Gamerman  (1997)  and  Gelman  et  al.  (1995). 
Relatively  accessible  treatments  are  provided  by  Gill  (2002),  Koop  (2003),  and  Lancaster 
(2004).  Koop  presents  Bayesian  methods  for  many  standard  nonlinear  cross-section  models  and 
for  panel  data.  The  older  texts  by  Zellner  (1971)  and  Learner  (1978)  are  still  valuable  sources 
of  results. 

13.2  Stigler  (1986)  provides  a  good  exposition  of  the  work  of  Bayes  (1764).  Bayes  first  pre¬ 
sented  some  properties  of  probability,  notably  Pr[A|B]  =  Pr[A  fl  B]/Pr[B].  Bayes  then 
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applied  this  result  to  obtaining  the  posterior  probability  Pr [a  <  6  <  7>|v],  where  a  and 
b  are  specified  bounds,  y  is  the  number  of  successes  in  N  binomial  trials,  and  0  is  the 
unknown  probability  of  success  in  each  trial.  Bayes  chose  a  uniform  prior,  in  which  case 
the  posterior  density  f(6\v)  oc  f(y\d).  Bayes"  example  was  challenging  as  he  could  not 
accurately  calculate  the  posterior  probability,  which  involved  the  incomplete  gamma,  not 
tabulated  until  the  20th  century.  Bayes’  paper  was  initially  neglected.  A  more  commonly 
used  approach  due  to  Laplace  and  others  was  the  method  of  inverse  probability  that  also  let 
f(9\y)  oc  f{y\9).  These  methods  were  supplanted  by  maximum  likelihood,  introduced  by 
Fisher  (1922),  whose  paper  directly  critiqued  Bayesian  and  inverse-probability  methods. 

The  regularity  conditions  for  convergence  to  posterior  normality  are  discussed  in  Heyde 
and  Johnstone  (1979).  Train  (2003)  provides  an  excellent  but  less  formal  treatment  of  the 
so-called  Bernstein-von  Mises  Theorem. 

13.3  Zellner  (1971)  and  Learner  (1978)  are  excellent  sources  for  Bayesian  analysis  of  linear 
regression. 

13.4  Geweke  (1989)  and  Geweke  and  Keane  (2001)  are  valuable  references  on  Monte  Carlo 
integration. 

13.5  Casella  and  George  (1992)  provide  an  expository  treatment  of  the  Gibbs  sampler.  Nu¬ 
merous  papers  by  Chib  and  his  collaborators  and  Geweke  and  his  collaborators  cover 
many  topics  of  interest  in  microeconometrics.  Chib  and  Greenberg  (1996,  Section  3)  pro¬ 
vide  a  number  of  applications  of  MCMC,  including  the  seemingly  unrelated  regression 
model  and  the  Tobit  and  probit  models.  In  the  latter  case  they  show  the  computational 
simplification  that  results  from  combining  Gibbs  sampling  with  the  data  augmentation 
approach.  Data  augmentation  is  used  to  handle  latent  variables  that  are  introduced  to 
deal  with  the  underlying  unobservables  that  arise  naturally  in  many  censored  and  dis¬ 
crete  choice  models.  Chib  (2001)  provides  a  detailed  and  up-to-date  survey  that  includes 
MCMC  algorithms  for  many  leading  linear  and  nonlinear  regression  models.  Geweke  and 
Keane  (2000)  concentrate  on  the  methods  of  integration;  they  cover  both  Bayesian  and 
non-Bayesian  topics. 


- Exercises - 

21-1  Show  that  if  f3\X  ~  A.-1  £],  and  X  ~  Gamma[ff/2,  ct/2],  then  the  uncondi¬ 

tional  distribution  of  f3  is  a  multivariate  (-distribution  with  parameters  (/i,  S,  a). 
21-2  (Adapted  from  Chib,  1992).  Consider  the  censored  regression  or  Tobit  model 
(see  Section  16.3)  where  y*  =  x'f3  +  s,e  ~iid  AC[0,  a2],  and  yis  observed  when 
y*  >  0  but  is  not  observed  (censored)  when  y*  <  0.  There  are  N0  censored  ob¬ 
servations  on  y,  and  y0  refers  to  them.  Introduce  a  latent  variable  zthat  cor¬ 
responds  to  the  censored  observations  such  that  z,  <  0  if  the  /th  observation 
belongs  to  the  censored  set.  The  data  augmentation  method  can  be  used  to 
draw  latent  variables  z/,  a  set  of  independent  random  variables  distributed  as 
truncated  normal,  with  support  (— oo,  0)  and  pdf  0(4|y,  /3,cr2)/(1  -  <D(x;./3/cr)), 
-oo  <  Zj  <  0.  where  0  and  4>  are,  respectively  the  normal  pdf  and  cdf.  Use  a 
normal  prior  for  (3  and  a  gamma  prior  for  a~ 2. 

(a)  Show  that  it  is  possible  to  specify  a  full  set  of  conditionals  for  z,-,  f3,  and  a~2. 

(b)  Use  the  results  of  part  (a)  to  outline  the  Gibbs  algorithm  for  simulating  z,-,  (3, 
and  a~2. 

(c)  Explain  how  suitable  initial  values  of  (3  and  a~ 2  may  be  obtained. 
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Models  for  Cross-Section 

Data 


Part  4,  consisting  of  chapters  14  to  20,  covers  the  core  nonlinear  limited  dependent 
variable  models  for  cross-section  data,  defined  by  the  range  of  values  taken  by  the 
dependent  variable.  Topics  covered  include  models  for  binary,  multinomial,  duration 
and  count  data.  The  complications  of  censoring,  truncation  and  sample  selection  are 
also  studied.  The  essential  base  for  Part  4  is  least  squares  and  maximum  likelihood 
estimation. 

Chapters  14-15  cover  models  for  binary  and  multinomial  data  that  are  standard  in 
the  analysis  of  discrete  outcomes  and  discrete  choice.  Maximum  likelihood  methods 
are  dominant.  Different  parameterizations  for  the  conditional  probabilities  in  these 
models  lead  to  different  models,  notably  logit  and  probit  models,  which  are  well- 
established.  Recent  literature  has  focused  on  less  restrictive  modeling  with  more  flex¬ 
ible  functional  forms  for  conditional  probabilities  and  on  accommodating  individual 
unobserved  heterogeneity.  These  objectives  motivate  the  use  of  semiparametric  meth¬ 
ods  and  simulation-based  estimation  methods. 

Censoring,  truncation,  or  sample  selection  generate  several  important  classes  of 
models  that  are  analyzed  in  Chapter  16.  The  long-established  Tobit  model  is  central  to 
this  literature,  but  its  estimation  and  inference  rely  on  strong  distributional  assumptions 
to  permit  consistent  estimation.  We  also  examine  the  newer  semiparametric  methods 
that  rely  on  weaker  assumptions. 

Chapters  17-19  consider  duration  models  in  which  the  focus  is  on  either  the  de¬ 
terminants  of  spell  lengths,  such  as  length  of  an  unemployment  spell,  or  on  modeling 
the  hazard  rate  of  transitions  from  one  initial  state  to  another.  The  analysis  covers 
both  discrete  and  continuous  time  models,  and  both  parametric  and  semiparametric 
formulations,  including  the  standard  models  like  the  exponential,  the  Weibull,  and 
the  proportional  hazards  model.  Chapter  18  covers  formulation  and  interpretation  of 
richer  models  that  incorporate  unobserved  heterogeneity.  The  relative  importance  of 
state  dependence  and  unobserved  heterogeneity  as  determinants  of  the  average  length 
of  spell  is  a  central  issue,  whose  resolution  raises  fundamental  questions  about  alterna¬ 
tive  modeling  approaches.  Chapter  19  deals  with  models  with  several  types  of  events 
using  the  competing  risks  formulation  and  models  of  multiple  spells. 
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Chapter  20  covers  the  analysis  of  event  count  of  the  kind  very  common  in  health 
economics.  There  are  many  strong  connections  and  parallels  between  count  data  mod¬ 
els  and  duration  models  because  of  their  common  foundation  in  stochastic  processes. 
We  analyze  the  widely-used  Poisson  and  negative  binomial  regression  models,  to¬ 
gether  with  important  variants  such  as  the  two-part  or  hurdle  model,  zero-inflated 
models,  latent  class  models,  and  endogenous  regressor  models,  all  of  which  accom¬ 
modate  different  facets  of  the  event  processes. 
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CHAPTER  14 


Binary  Outcome  Models 


14.1.  Introduction 

Discrete  outcome  or  qualitative  response  models  are  models  for  a  dependent  variable 
that  indicates  in  which  one  of  m  mutually  exclusive  categories  the  outcome  of  interest 
falls.  Often  there  is  no  natural  ordering  of  the  categories.  For  example,  categorization 
may  be  on  the  occupation  of  a  worker. 

This  chapter  considers  the  simplest  case  of  binary  outcomes,  where  there  are 
two  possible  outcomes.  Examples  include  whether  or  not  an  individual  is  employed 
and  whether  or  not  a  consumer  makes  a  purchase.  Binary  outcomes  are  simple 
to  model  and  estimation  is  usually  by  maximum  likelihood  because  the  distribu¬ 
tion  of  the  data  is  necessarily  defined  by  the  Bernoulli  model.  If  the  probabil¬ 
ity  of  one  outcome  equals  p,  then  the  probability  of  the  other  outcome  must  be 
(1  —  p).  For  regression  applications  the  probability  p  will  vary  across  individuals 
as  a  function  of  regressors.  The  two  standard  binary  outcome  models,  the  logit  and 
the  probit  models,  specify  different  functional  forms  for  this  probability  as  a  func¬ 
tion  of  regressors.  The  difference  between  these  estimators  is  qualitatively  simi¬ 
lar  to  use  of  different  functional  forms  for  the  conditional  mean  in  least-squares 
regression. 

Section  14.2  provides  a  data  example.  Section  14.3  presents  a  summary  of 
statistical  results  for  standard  models  including  logit  and  probit  models.  In  Sec¬ 
tion  14.4  binary  outcome  models  are  presented  as  arising  from  an  underlying 
latent  variable.  This  formulation  is  useful  as  it  extends  readily  to  multinomial 
models  (see  Chapter  15)  and  models  for  censored  and  selected  samples  (see 
Chapter  16).  Section  14.5  details  necessary  modifications  to  standard  estimation 
methods  when  one  of  the  outcomes  is  deliberately  oversampled.  Aggregation  is¬ 
sues  are  considered  in  Section  14.6.  Semiparametric  methods  for  binary  outcome 
models  that  place  less  structure  on  the  model  for  the  probability  p  are  given  in 
Section  14.7. 
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14.2.  Binary  Outcome  Example:  Fishing  Mode  Choice 


This  section  models  choice  between  fishing  from  a  charter  boat  and  fishing  from  a  pier. 
The  dependent  variable  is  binary  with 


yi 


i 

o 


if  fishing  from  a  charter  boat, 
if  fishing  from  a  pier. 


where  the  values  1  and  0  are  chosen  for  simplicity.  The  single  explanatory  variable  is 
Xj  =  In  relp,  =  ln(relp, )  where  relp  denotes  the  price  of  charter  fishing  relative  to  the 
price  of  fishing  from  the  pier,  so 


Xi  =  In  relp,-  =  In  (priceCharter,;/pricepier,;) . 


The  prices  of  charter  boat  and  pier  fishing  vary  across  individuals  owing  to  various 
factors,  for  example,  to  differences  in  access.  It  is  expected  that  the  probability  of 
charter  boat  fishing  decreases  as  its  relative  price  increases. 

The  data  are  summarized  in  Table  14.1.  The  sample  of  630  individuals  is  a  subset 
of  the  data  described  in  greater  detail  in  Section  15.2,  where  four  different  modes  of 
fishing  and  additional  regressors  are  considered.  Charter  boat  fishing  was  selected  by 
71.7%  of  the  sample.  For  people  choosing  to  fish  from  the  charter  boat,  the  charter  boat 
was  on  average  less  expensive  than  pier  fishing,  as  $75  <  $121.  For  people  choosing 
to  fish  from  the  pier  the  reverse  was  true.  So  it  appears  that  price  has  the  expected 
effect. 

An  OLS  regression  of  y-t  on  x,  ignores  the  discreteness  of  the  dependent  variable 
and  does  not  constrain  predicted  probabilities  to  be  between  zero  and  one. 

A  more  appropriate  model  is  the  logit  model  (see  Section  14.3.4),  which  specifies 


Pi  =  Pr[>’,  =  l|x;]  = 


exp(y6i  +  p2*i) 

1  +  exp(/3,  +  p2Xi) 


and  clearly  ensures  that  0  <  p,  <  1 .  Maximum  likelihood  estimation  (see  Sec¬ 
tion  14.3.3)  leads  to  parameter  estimates  given  in  the  first  column  of  Table  14.2.  The 
implied  marginal  effect  for  the  logit  model  equals 

dpi  _  expQ6i  +  p2Xj) 
dxj  (1  +  exp()Si  +  PiXi))1 


Table  14.1.  Fishing  Mode  Choice:  Data  Summary 


Subsample  Averages 


Variable 

y  =  i 

Charter 

y  =  0 

Pier 

Ally 

Overall 

Price  charter  ($) 

75 

110 

85 

Price  pier  ($) 

121 

31 

95 

In  relp 

-0.264 

1.643 

0.275 

Sample  probability 

0.717 

0.283 

1.000 

Observations 

452 

178 

630 
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Table  14.2.  Fishing  Mode  Choice:  Logit  and  Probit  Estimates a 


Regressor 

Logit 

Model  Probit 

OLS 

Constant 

2.053 

1.194 

0.784 

(12.15) 

(13.34) 

(65.58) 

In  relp 

-1.823 

-1.056 

-0.243 

(-12.61) 

(-13.87) 

(-28.15) 

—  In  L 

-206.83 

-204.41 

- 

Pseudo  R 2 

0.449 

0.455 

0.463 

a  Dependent  variable  y  =  1  if  charter  boat  fishing  and  y  =  0  if  pier  fishing.  Regressor 
x  =  In  relp,  the  natural  logarithm  of  the  price  of  charter  boat  fishing  relative  to  pier 
fishing.  Intercept  and  slope  parameter  estimates  with  ^-statistics  in  parentheses  are 
from  ML  estimation  of  logit  and  probit  models  and  from  OLS  estimation. 


Since  /J2, logit  <  0  it  follows  that  dpi/dxj  <  0,  as  expected.  The  actual  magnitude  of 
the  marginal  effect  varies  with  the  point  of  evaluation  Xj  (see  Section  14.3.2).  An  ap¬ 
proximation  for  the  logit  model,  though  not  other  models,  is  that  dpi/dxj  ~  y(l  — 
y)fi2  =  —0.370.  An  OLS  regression  instead  provides  a  direct  estimate  of  —0.243. 

An  alternative  model  is  the  probit  model  (see  Section  14.3.5),  which  specifies 

Pi  =  Pr[y,-  =  1 1  Vi]  =  <&(fii  +  PiXi), 

where  O(-)  is  the  cumulative  distribution  function  for  the  standard  normal,  so  p,  = 
f^PlX\2Tt)-ll2e-zll2dz.  The  ML  coefficients  are  given  in  the  second  column  of  Ta¬ 
ble  14.2  and  differ  appreciably  from  the  logit  coefficients.  Since  different  specifications 
are  being  estimated  the  coefficients  are  not  comparable.  This  is  similar  to  our  inabil¬ 
ity  to  compare  coefficients  in  models  with  conditional  mean  x'/3  and  exp(x'/3).  For 
the  probit  model  dp,  / dxj  =  <p(fii  +  /Ta'iI/L,  where  </>(•)  is  the  density  for  the  standard 
normal.  So  again  dpi/dx ,•  <  0  since  /L, probit  <  0. 

Although  the  slope  coefficients  necessarily  differ  across  the  models,  from  Ta¬ 
ble  14.2  the  f -statistics  are  similar  and  are  all  very  high.  The  log-likelihood  for 
the  probit  model  is  2.42  higher  than  that  for  the  logit,  favoring  the  probit  model 
since  both  models  use  the  same  number  of  parameters.  In  many  other  examples  there 
is  little  difference  in  In  L  across  the  models.  The  predicted  probabilities  from  the 
three  models  are  plotted  as  a  function  of  x  in  Figure  14.1.  In  OLS  we  assume  that 
Pr[y,  =  1  |.v,]  =  +  fiiXj  is  linear  in  xl:  whereas  the  nonlinear  functions  for  logit  and 

probit  are  essentially  equivalent. 


14.3.  Logit  and  Probit  Models 

We  now  provide  more  formal  theory  for  these  models.  We  present  binary  outcomes 
as  a  direct  extension  of  the  coin-toss  example  of  introductory  statistics  to  situations 
where  the  probability  of  success  is  modeled  to  depend  on  regressors.  Two  commonly 
used  parameterizations  lead  to  the  logit  and  probit  models.  Motivation  for  these  pa- 
rameterizations,  using  latent  variables,  is  deferred  to  Section  14.4. 
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Predicted  Probabilities  Across  Models 


Log  relative  price  (Inrelp) 

Figure  14.1:  Charter  boat  fishing:  predicted  probability  from  logit  and  probit  models  and 
OLS  prediction  when  the  single  regressor  is  the  natural  logarithm  of  relative  price.  Actual 
outcomes  of  1  or  0  are  also  plotted  after  jittering  for  readability.  Data  for  620  individuals. 


14.3.1.  General  Binary  Outcome  Model 
For  binary  outcome  data  the  dependent  variable  y  takes  one  of  two  values.  We  let 


y 


1 

0 


with  probability  p, 
with  probability  1  —  p. 


There  is  no  loss  of  generality  in  setting  the  values  to  1  and  0  if  all  that  is  being  modeled 
is  p,  which  determines  the  probability  of  the  outcome.  In  introductory  statistics  this 
model  describes  the  outcome  of  a  coin  toss  where  heads  leads  to  y  =  1  and  occurs 
with  probability  p. 

A  regression  model  is  formed  by  parameterizing  the  probability  p  to  depend  on  a 
regressor  vector  x  and  a  K  x  I  parameter  vector  (3.  The  commonly  used  models  are 
of  single-index  form  with  conditional  probability  given  by 


Pi  =  Prlv;  =  l|x]  =F(x'i(3),  (14.1) 

where  F(-)  is  a  specified  function.  To  ensure  that  0<  p  <  1  it  is  natural  to  specify 
F(-)  to  be  a  cumulative  distribution  function. 

Table  14.3  presents  the  most  commonly  used  binary  outcome  models.  The  logit 
model  arises  if  F(-)  is  the  cdf  of  the  logistic  distribution  and  the  probit  model  arises 
if  F(-)  is  the  standard  normal  cdf.  Note  that  if  F(-)  is  a  cdf,  then  this  cdf  is  only 
being  used  to  model  the  parameter  p  and  does  not  denote  the  cdf  of  y  itself.  The 
less-used  complementary  log-log  model  arises  if  F(-)  is  the  cdf  of  the  extreme  value 
distribution.  It  differs  from  the  other  models  in  being  asymmetric  around  zero  and  is 
used  when  one  of  the  outcomes  is  rare.  The  linear  probability  model  does  not  use  a 
cdf  and  instead  lets  p,  =  x- (3. 
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Table  14.3.  Binary  Outcome  Data:  Commonly  Used  Models 


Model 

Probability  (p  =  Pr[y  =  l  x]) 

Marginal  Effect  (9p/9xj) 

Logit 

pX'P 

AW5)=  l+,« 

A(x'/3)[1  -  A(x'/3)]/3; 

Probit 

<b(x'/3)  =  cp{z)dz 

(p(x'(3)Pj 

Complementary  log-log 

C(x'(3)  —  1  —  exp(—  exp  (x'f3)) 

exp(—  exp(x'/3))  exp(x'/3)/3; 

Linear  probability 

x!  (3 

Pi 

14.3.2.  Marginal  Effects 

Interest  lies  in  determining  the  marginal  effect  of  change  in  a  regressor  on  the  condi¬ 
tional  probability  that  y  =  1.  For  general  probability  model  (14.1)  and  change  in  the 
y'th  regressor,  assumed  to  be  continuous,  this  is 


9  Pr[y,  =  l|x,] 

dxa 


=  F'(x'f3)Pn 


(14.2) 


where  F'(z)  =  9  F(z)/3z.  The  marginal  effects  differ  with  the  point  of  evaluation  x,-, 
as  for  any  nonlinear  model,  and  differ  with  different  choices  of  F(-).  The  last  column 
of  Table  14.3  gives  the  marginal  effects  for  the  common  binary  outcome  models. 

Marginal  effects  for  nonlinear  models  are  discussed  in  Section  5.2.4.  Given  a  spe¬ 
cific  model  there  are  several  ways  to  compute  an  average  marginal  effect.  It  is  best  to 
use  N  1  T'(x'/3)/l  ;,  the  sample  average  of  the  marginal  effects.  Some  programs 

instead  evaluate  at  the  sample  average  of  the  regressors,  F'(\  r  An  easily  con¬ 
structed  measure  evaluates  at  y,  the  sample  average  of  y,  so  that  F(x'f3)  =  y  and 
F'(x'/3)  =  F'(F  1  (>’)).  This  is  especially  simple  for  the  logit  model  as  then  this  yields 
estimated  marginal  effect  v(l  —  yip  r  Further  discussion  for  specific  models  is  given 
in  Sections  14.3.4-14.3.7. 

Many  studies  instead  report  only  the  regression  coefficients.  The  standard  binary 
outcome  models  are  single-index  models,  so  the  ratio  of  coefficients  for  two  different 
regressors  equals  the  ratio  of  the  marginal  effects.  The  sign  of  the  coefficient  gives 
the  sign  of  the  marginal  effect,  since  no  >  0.  The  coefficients  can  be  used  to  obtain 
an  upper  bound  on  the  marginal  effects.  For  the  logit  model  dp/dxj  <  0.25/1 since 
AIx'/lXl  —  A(x'/3))  <  0.25,  with  maximum  when  A(x'/3)  =  0.5  and  x! (5  =  0.  For  the 
probit  model  dp/dxj  <  0.4/L,  since  </>(x'/3)  <  I  /  s/lzt  ~  0.4,  with  maximum  when 
4>(x,/3)  =  0.5  and  x'/3  =  0. 


14.3.3.  MF  Estimation 

We  consider  estimation  given  a  sample  (y7  ,  x,  ),  /  =  I .....  A,  where  we  assume  inde¬ 
pendence  over  i.  Results  are  given  for  p,  defined  in  (14.1),  with  specialization  to  logit 
and  probit  specifications  given  later. 
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MLE  for  General  Binary  Outcome  Models 

The  outcome  is  Bernoulli  distributed,  the  binomial  distribution  with  just  one  trial.  A 
very  convenient  compact  notation  for  the  density  of  y, ,  or  more  formally  its  probabil¬ 
ity  mass  function,  is 


f(yi\*i)  =  Pi,Q-Pi)1-*'  y>  =  0,1,  (14.3) 

where  p,  =  F(x-/3).  This  yields  probabilities  p,  and  (1  —  p, )  since  /( 1)  =  p 1  ( I  — 
P)°  =  P  and  /( 0)  =  p°(l  -  p)1  =  1  -  p. 

The  density  (14.3)  implies  log  density  In  /(>’,)  =  y,  In  p,  +(1  —  y, )  ln(l  —  p,). 
Given  independence  over  i  and  model  (14.1)  for  p,,  the  log-likelihood  function  is 

N 

Cn((3)  =  Y,  {y>  +  (!  -  y.-)ln(l  -  Ftfl 3))} .  (14.4) 

i=i 

Differentiating  with  respect  to  /3,  we  have  that  the  MLE  /3ML  solves 


where  F,  =  F(xj/3),  F'  =  F'ix'/d),  and  F'(z)  =  3  F(z)/dz-  Converting  to  fractions 
with  common  denominator  F,(l  —  F )  and  simplifying  yields  the  ML  first-order  con¬ 
ditions 


E 


yi  -  F(xj/3) 
F(x'/3)(1  -  F(xj/3)) 


F'(xj/3)Xi  =  0. 


(14.5) 


There  is  no  explicit  solution  for  /3MLE,  but  the  Newton-Raphson  iterative  procedure 
usually  converges  very  quickly  since  for  the  probit  and  logit  models,  at  least,  the  log- 
likelihood  is  globally  concave. 


Consistency  of  the  MLE 

The  MLE  is  consistent  if  the  conditional  density  of  y  given  x  is  correctly  specified. 
Since  the  density  here  must  be  the  Bernoulli,  the  only  possible  misspecification  is  that 
the  Bernoulli  probability  is  misspecified.  So  the  MLE  is  consistent  if  p,  =  F(x'/3)  and 
is  inconsistent  otherwise. 

More  formally,  note  that  for  binary  data,  E[v]  =  lxp  +  0x(l  —  p)  =  p.  Given 
(14.1)  this  implies 


E[y,|x,]  =  F(x;/3),  (14.6) 

which  in  turn  implies  that  the  left-hand  side  of  the  first-order  equations  (14.5)  has 
expected  value  zero,  the  essential  condition  for  consistency.  This  special  result  of  con¬ 
sistency  provided  the  conditional  mean  is  correctly  specified  holds  for  LEF  densities 
(see  Section  5.7.3)  and  the  Bernoulli  is  an  LEF  density. 
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Distribution  of  the  MLE 

Given  correct  specification  of  the  density,  /3ML  ~  Af[/3,  (— E[32£Ar/3/33/3/])  *]  (see 
Section  5.6.4).  Differentiating  (14.4)  with  respect  to  (3 ' ,  and  taking  minus  the  expected 
value  yields  the  estimated  asymptotic  variance  matrix 


V[3ml1 


E 


■={  F(x'/3)(1  -  Ftf/3)) 


F' (x'j  fifxjx'j 


(14.7) 


where  simplification  occurs  because  E [ y,  —  F(x'/3)]  =  0.  This  variance  matrix  is  of 
the  simple  form  (£A  w5,-x,-xj)_1,  where  the  weights  w,  are  given  in  (14.7). 

Since  consistency  requires  only  correct  specification  of  the  conditional  mean  or 
probability,  it  is  natural  to  consider  the  quasi-MLE  (see  Section  5.7)  and  base  infer¬ 
ence  on  the  sandwich  form  of  the  variance  matrix  A  1  BA  1  rather  than  —A  " 1  used  in 
(14.7).  Here 


V[y,|x,]  =  -  F(x'/3)),  (14.8) 

since  V[  v]  =  (1  —  p)2  x  p  +  (0  —  p)2  x  (1  —  p)  =  p(  1  —  p).  Some  algebra  shows 
that  this  implies  that  A  =  — B  and  hence  A  1  BA  1  =  —  A-1,  assuming  independence 
over  i.  The  only  way  that  (14.8)  does  not  hold  is  if  p  /  F(x'/3)  in  which  case  the  MLE 
would  suffer  from  the  more  fundamental  problem  of  inconsistency. 

Binary  outcome  models  are  unusual  in  that  there  is  no  advantage  in  using  the  sand¬ 
wich  form  if  data  are  independent  over  i.  The  only  reason  for  moving  to  a  robust  vari¬ 
ance  matrix  estimate  is  if  observations  are  correlated  over  i  as  the  result  of  clustering. 
Then  the  robust  estimate  needs  to  be  one  that  is  robust  to  clustering  (see  Section  24.5) 
rather  than  to  misspecification  of  the  conditional  variance. 


14.3.4.  Logit  Model 

The  logit  model  or  logistic  regression  model  specifies 

P  =  A  (x'/3)  =  E-E  (14.9) 

1  +  exP 

where  A(-)  is  the  logistic  cdf  (see  Section  14.4.1  for  further  details),  with  A(z)  = 
ez/{\  +ez)  =  1/(1  +  e~z). 

The  logit  MLE  first-order  conditions  (14.5)  simplify  to 

N 

-  A(X;/3))Xi  =  0,  (14.10) 

/■= I 

since  A'(z)  =  A(z)[l  —  A(z)].  So  the  raw  residual  y,  —  A(x-/3)  is  orthogonal  to  the 
regressors,  similar  to  OLS  regression.  This  simple  form  arises  because  A(-)  is  the 
canonical  link  function  (see  Section  5.7.4)  for  the  Bernoulli  density. 

If  the  regressors  x,  include  an  intercept,  then  (14.10)  implies  that  — 

A(x'/3))  =  0,  so  the  logit  residuals  sum  to  zero.  This  implies  that  the  average  in-sample 
predicted  probability  A-1  JT  A(x'/3)  necessarily  equals  the  sample  frequency  y. 
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The  marginal  effects  for  the  logit  model  can  be  fairly  easily  obtained  from  the 
coefficients,  since  dpj/dxjj  =  pi(  1  —  Pi)/3j,  where  /?,  =  A,-  =  A(x-/3).  Evaluating  at 
Pi  =  y  yields  a  crude  estimated  marginal  effect  of  y(  I  —  y)(J>  r  For  0.3  <  p,  <  0.7, 
for  example,  dpj/dxjj  lies  between  0.21/1 j  and  0.25/1,.  For  data  where  /?,  ~  0.0,  in 
which  case  most  outcomes  are  zero,  dpj/dxij  =  p,fJ>j  so  /i,  gives  the  proportionate 
effect  on  the  probability  that  y,  =  1  as  x,,  changes. 

In  the  statistics  literature  a  very  common  interpretation  of  the  coefficients  is  in  terms 
of  marginal  effects  on  the  odds  ratio  rather  than  on  the  probability.  For  the  logit  model 

p  —  exp(x'/3)/(l  +  exp(x'/3) 

=>•  =  exp(x'/3)  (14.11) 

=►  In  =  x'/3. 

Here  p/{  1  —  p)  measures  the  probability  that  y  =  1  relative  to  the  probability  that 
y  =  0  and  is  called  the  odds  ratio  or  relative  risk.  For  example,  consider  a  phar¬ 
maceutical  drug  study  where  y  =  1  denotes  survival  and  y  =  0  denotes  death  and 
regressors  include  a  measure  of  drug  intake.  An  odds  ratio  of  2  means  that  the  odds  of 
survival  are  twice  those  of  death.  For  the  logit  model  the  log-odds  ratio  is  linear  in  the 
regressors. 

Statistical  analyses  and  packages  use  the  second  equality  in  (14.11).  Suppose  the 
j th  regressor  increases  by  one  unit.  Then  exp (x'/3)  increases  to  cxp(x'/3  +  /i,)  = 
cxp(x'/3)  x  exp  (/;,).  It  follows  from  (14.11)  that  the  odds  ratio  has  increased  by  a  mul¬ 
tiple  exp  (/),).  Thus  a  logit  model  slope  parameter  of  0.1,  for  example,  means  that  a 
one  unit  increase  in  the  regressor  multiplies  the  initial  odds  ratio  by  exp(O.l)  ~  1.105. 
This  is  a  proportionate  increase  of  0.105  times  the  initial  odds  ratio,  so  the  relative 
probability  of  survival  increases  by  10.5%.  This  interpretation  of  the  logit  model  is 
widely  used  in  biostatistics  applications. 

For  economists  it  is  more  natural  to  interpret  either  the  second  or  third  equality  in 
(14.11)  as  implying  that  /i,  is  a  semi-elasticity.  Then,  taking  a  calculus  approach,  we 
interpret  a  logit  model  slope  parameter  of  0.1  as  meaning  that  a  one-unit  increase  in 
the  regressor  increases  the  odds  ratio  by  a  multiple  0.1.  This  coincides  exactly  with  the 
interpretation  used  in  statistics  for  very  small  /),-,  since  then  cxp(/i;)  —  I  ~  /),-. 


14.3.5.  Probit  Model 
The  probit  model  specifies  the  conditional  probability 

rx'P 

p  =  q>(x73)=  /  0(z)dz,  (14.12) 

J  —  oo 

where  O(-)  is  the  standard  normal  cdf,  with  derivative  ([>(")  =  (l/x/2jr)exp(— z2/2), 
which  is  the  standard  normal  density  function. 

The  probit  MLE  first-order  conditions  are  that 

N 

7:  Wi(yi  -  4 >(x'/3))x;  =  0, 

i= 1 
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where,  unlike  the  logit  model,  the  weight  uy  =  0(x-/3)/[ 4>(x'-/3)( I  —  Ofx'/3 ) ) ]  varies 
across  observations. 

The  probit  model  marginal  effects  are  dpt/dxjj  =  0(xj/3)/37-  =  0(<1>  1  (/?, ))/S7 , 
where  p,  =  <t>(x'/3).  There  are  no  further  simplifications  similar  to  those  for  the  logit 
model,  though  dp-J'dxij  <  0.40/3/  since  0(z)  <  0(0.5)  =  \/ \fljt. 

The  probit  model  is  not  as  simple  as  the  logit  model.  It  is  nevertheless  widely  used 
as  it  is  the  natural  model  if  the  starting  point  is  a  latent  normal  regression  model  (see 
Section  14.4). 


14.3.6.  OLS  Estimation 

A  simple  alternative  to  either  logit  or  probit  is  OLS  regression  of  y  on  x.  This  has 
the  obvious  deficiency  that  it  is  possible  to  obtain  predicted  probabilities  x'  (3  that  are 
negative  or  that  exceed  one. 

The  OLS  estimator  is  nonetheless  useful  as  an  exploratory  tool.  In  practice  it  pro¬ 
vides  a  reasonable  direct  estimate  of  the  sample-average  marginal  effect  on  the  prob¬ 
ability  that  y  =  1  as  x  changes,  even  though  it  provides  a  poor  model  for  individual 
probabilities.  In  practice  it  provides  a  good  guide  to  which  variables  are  statistically 
significant.  In  many  applications  it  turns  out  that  0  <  x7/3  <  1  for  all  sample  observa¬ 
tions,  in  which  case  OLS  is  more  reasonable. 

If  the  OLS  estimator  is  used  then  standard  errors  should  correct  for  heteroskedas- 
ticity.  Linear  regression  is  justified  if  the  probability  p,  =  x'/d.  Then  y,jx,  has  mean 
x'/3  and  heteroskedastic  variance  xj/3(l  —  x'/3)  that  varies  with  x,  . 

In  theory  more  efficient  ML  estimation  is  possible  if  p,  =  x-/3.  From  (14.5)  the  ML 
first-order  conditions  are  JT  x,(y,  —  x'/3)/[x'/3(l  —  xj/3)]  =  0.  The  estimator  can  be 
numerically  unstable  as  it  places  very  high  weight  on  to  observations  with  x'  / 3  close 
to  0  or  1 .  Moreover,  the  efficiency  gains  compared  to  OLS  are  often  small. 

Although  OLS  estimation  with  heteroskedastic  standard  errors  can  be  a  useful  ex¬ 
ploratory  data  analysis  tool,  it  is  best  to  use  the  logit  or  probit  MLE  for  final  data 
analysis. 


14.3.7.  Choosing  a  Binary  Model 

Which  model  should  be  used  -  logit  or  probit?  This  question  is  explored  in  this  section. 


Theoretical  Considerations 

Theoretically  the  answer  depends  on  the  dgp,  which  is  unknown.  Unlike  other  appli¬ 
cations  of  ML  there  is  no  problem  in  specifying  the  distribution  -  the  only  possible 
distribution  for  a  (0,  1)  variable  is  the  Bernoulli.  The  problem  lies  in  specifying  a 
functional  form  for  the  parameter  of  this  distribution.  If  the  dgp  has  p  =  A(x'/3)  then 
a  logit  model  should  be  used,  and  estimators  based  on  other  models  such  as  probit 
are  potentially  inconsistent.  Similar  qualitative  conclusions  hold  if  instead  the  dgp  has 
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p  =  0(x'/3),  in  which  case  the  probit  model  should  be  used.  It  is  very  unlikely  that 
p  =  x'/3  since  then  p  is  not  restricted  to  be  between  0  and  1 . 

The  theoretical  consequences  of  model  misspecification,  however,  are  not  as  great 
as  this.  If  the  regressors  are  distributed  such  that  the  mean  of  each  regressor,  condi¬ 
tional  on  the  linear  combination  x'/3,  is  linear  in  x'/3,  then  choosing  the  wrong  function 
F  can  be  shown  to  affect  all  slope  parameters  equally  so  that  the  ratio  of  slope  param¬ 
eters  is  constant  across  models;  see  Ruud  (1983).  This  condition  is  satisfied  by  the 
family  of  spherical  distributions,  including  the  multivariate  normal. 

The  logit  model  has  a  relatively  simple  form  for  the  first-order  conditions  and 
asymptotic  distribution.  Berkson  (1951),  who  popularized  the  logit  model,  gave  this 
as  one  of  several  reasons  for  preferring  the  logit  model  to  the  original  probit  model. 
Within  the  framework  of  generalized  linear  models,  which  are  widely  used  in  biostatis¬ 
tics,  the  logit  model  is  the  natural  model  as  it  corresponds  to  use  of  the  canonical  link 
for  the  binomial  distribution.  The  interpretation  of  coefficients  in  terms  of  the  log-odds 
ratio  is  also  an  attraction  of  the  logit  model. 

Yet  another  motivation  for  the  logit  model  is  discriminant  analysis.  In  discriminant 
analysis  both  y  and  x  are  random  variables;  x  is  observed  but  y  is  not  observed.  Given 
x  we  need  to  determine  whether  y  equals  zero  or  one.  A  classic  example  is  classifying 
what  type  of  humanoid  (y  =  0  or  1 )  a  skull  belongs  to  given  various  dimensions  (x)  of 
the  skull.  If  the  conditional  distributions  of  the  characteristics  x  given  y  are  multivariate 
normal  distributed,  the  posterior  probability  of  y  given  x  is  similar  to  the  probability  in 
the  logit  model.  For  more  details,  see  Amemiya  (1981,  pp.  1507-1510)  and  Maddala 
(1983,  pp.  17-21). 

The  probit  model,  in  contrast,  has  the  attraction  of  being  motivated  by  a  latent  nor¬ 
mal  random  variable  (see  Section  14.4)  and  extends  naturally  to  Tobit  models  (see 
Chapter  16).  For  these  reasons  many  economists  use  the  probit  model. 


Empirical  Considerations 

Empirically,  either  logit  and  probit  can  be  used.  There  is  often  little  difference  be¬ 
tween  the  predicted  probabilities  from  probit  and  logit  models.  The  difference  is  great¬ 
est  in  the  tails  where  probabilities  are  close  to  0  or  1 .  The  difference  is  much  less  if 
interest  lies  only  in  marginal  effects  averaged  over  the  sample  rather  than  for  each 
individual. 

The  natural  metric  to  use  to  compare  models  is  the  fitted  log-likelihood,  since  there 
is  agreement  that  the  log-likelihood  is  correct,  given  the  model  for  pt,  and  the  logit  and 
probit  models  have  the  same  number  of  parameters.  Thus  for  each  model  compute 

£jv(/ 3)  =  ^y‘  lnP'  +  C1  “  Y)ln(l  -  Pi)} , 

i 

where  /?,  =  A(x'/3I  ogit)  or  /),  =  <F(X;/3Probit).  Often  the  fitted  log-likelihoods  are  very 
similar  for  the  two  models,  again  suggesting  little  additional  gain  to  using  one  rather 
than  the  other  model.  For  more  formal  nonnested  model  tests  see  Pesaran  and  Pesaran 
(1995)  and  Section  8.5. 
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The  different  models  do  yield  quite  different  estimates  ft  of  regression  parameters. 
However,  this  is  just  an  artifact  of  using  different  formulas  for  the  probabilities.  It  is 
more  meaningful  to  compare  the  marginal  effect  across  models,  as  this  measure  is 
scaled  similarly  across  the  three  models.  From  Section  14.2.3,  dp/dxj  <  0.25/4 ■  for 
logit,  dp/dxj  <  0.4 /3j  for  probit,  and  dp/dxj  =  f> ,  for  OLS.  This  suggests  the  rule 
of  thumb 

ft  Logit  —  4/3OLS,  (14.13) 

Probit  —  2-5/30ls, 

ft  Logit  —  1  -h/Ipnibit- 

Amemiya  (1981,  p.  1488)  demonstrates  that  these  comparisons  work  quite  well  for 
slope  parameters  if  0.1  <  p  <  0.9.  Greater  departures  across  the  models  occur  in 
the  tails.  For  logit  an  alternative  method,  based  on  (14.18)  given  later;  uses  /3Logit  ~ 
(n/V3)ft 

Probit- 


Endogenous  Regressors 

Logit  and  probit  models  can  be  extended  to  handle  many  of  the  complications  that 
commonly  arise  in  microeconometric  analysis.  In  particular,  endogenous  regressors 
are  accommodated  using  methods  similar  to  those  for  censored  data  given  in  Sec¬ 
tion  16.8.2,  and  panel  data  methods  are  presented  in  Chapter  23. 

For  such  complications  it  is  easier  to  work  with  the  linear  probability  model,  since 
then  standard  linear  model  methods  can  be  applied  provided  standard  errors  adjust  for 
heteroskedasticity.  Even  if  logit  and  probit  models  are  ultimately  used,  a  linear  model 
can  be  useful  for  exploratory  analysis. 


14.3.8.  Determining  Model  Adequacy 

Model  diagnostics  and  selection  for  nonlinear  models  were  presented  in  Section  8.7. 
Here  we  consider  specialization  to  binary  outcome  models.  There  is  no  single  best 
measure,  and  statistical  packages  accordingly  report  several  measures  detailed  in 
Amemiya  (1981)  and  Maddala  (1983). 


Pseudo-  R2 

A  standard  measure  of  goodness  of  fit  in  the  linear  regression  model  is  R2.  Generaliza¬ 
tions  to  nonlinear  models  are  called  pseudo-/?2,  with  several  generalizations  possible. 

A  preferred  measure  is  the  relative  gain  measure  denoted  /*jjG  in  Section  8.7.1.  This 
measure  is  not  always  computable,  but  it  is  for  the  binary  outcome  model  since  <2  max* 
the  maximum  possible  value  of  the  log-likelihood,  is  zero.  To  obtain  this  result  note 
that  the  best  possible  fit  is  clearly  a  y*  that  predicts  y  =  1  with  probability  p  =  1  and 
y  =  0  with  probability  1  —  p  =  0,  in  which  case  f(y*)  =  1  and  In  f(y*)  =  0.  Then 
R^g  =  1  —  (0  —  <2fit)/(0  —  Qo)  =  1  —  <2fit/<2o-  This  yields  the  R2  measure  for  binary 


473 


BINARY  OUTCOME  MODELS 


outcome  models  proposed  by  McFadden  (1974): 


r>2  _ 

A  Binary  — 


cN(P) 


i  E,  { V/  In  A  +  (1  -  y,  )  ln(  1  -  %)} 

A[ylny  +  (l-y)ln(l-y)] 


(14.14) 


where  a  =  F(x'fi)  and  y  =  N  1  E,  >’<  ■ 

Additional  R2  measures,  many  specific  to  binary  data,  are  given  in  Amemiya  (1981) 
and  Maddala  (1983).  An  obvious  one  is  the  squared  sample  correlation  between  y, 
and  /?,.  One  of  these  additional  measures  is  also  attributed  to  McFadden,  and  many 
references  give  this  measure  rather  than  the  R2  in  (14.14). 


Predicted  Outcomes 

In  the  linear  regression  model  goodness  of  fit  is  often  evaluated  by  comparison  of 
fitted  and  actual  values.  For  binary  data  the  fitted  value  y  should  be  binary  since  y 
is  binary.  The  criterion  E,(Vi  —  yd 2  gives  the  number  of  wrong  predictions,  which 
arise  if  (y,  y)  equals  (1,0)  or  (0,  1).  An  obvious  prediction  rule  is  to  set  y  =  1  when 
'p  =  F(x'ft)  >  0.5.  However,  this  has  the  weakness  that  if  most  of  the  sample  has 
y  =  1  then  often  E,(Vi  —  VV )2  =  n(l  —  y)  since  it  is  likely  that  'p  >  0.5  and  hence 
y  =  I  for  all  the  observations.  Similar  problems  arise  if  most  of  the  sample  has  y  =  0. 

More  generally,  a  range  of  cutoff  values  may  be  considered.  Letting  y  =  I  when 
'p  >  c,  we  obtain  the  receiver  operating  characteristics  (ROC)  curve  which  plots 
the  fraction  of  y  =  1  values  correctly  classified  against  the  fraction  of  y  =  0  values 
incorrectly  specified  as  the  cutoff  c  varies.  For  c  =  1  all  values  are  predicted  to  be  1, 
so  all  y  =  1  values  are  correctly  specified  and  all  y  =  0  values  are  incorrectly  specified 
and  the  ROC  curve  takes  value  (0,  0).  Similarly,  for  c  =  0  the  ROC  curve  takes  value 
(1,1). 

If  the  model  has  no  predictive  ability  the  ROC  curve  is  a  straight  line  between  these 
points.  The  more  bowed  the  curve,  and  the  more  area  under  it,  the  better  the  predictive 
power  of  the  model. 


Predicted  Probabilities 

Since  binary  data  have  a  simple  discrete  distribution,  an  obvious  approach  is  to 
compare  the  sample  average  predicted  probability  that  y  =  1,  N~ 1  E,  'Ph  where 
A  =  F(x'-fi),  with  the  sample  frequency  y.  However,  this  is  not  useful  for  the  logit 
model  with  an  intercept,  since  N  1  E,  'Pi  =  >'  always  holds  as  the  ML  first-order  con¬ 
ditions  imply  E,  [.Vi  —  A(x' (3)}  =  0.  A  similar  result  holds  for  estimation  by  OLS;  for 
the  probit  model  the  result  is  not  exact  but  in  practice  is  quite  close. 

This  approach  can  be  used  for  predictions  over  subsamples,  however,  and  can  then 
form  the  basis  for  the  chi-square  goodness-of-fit  test  given  in  Section  8.2.6. 
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14.4.  Latent  Variable  Models 

A  latent  variable  is  a  variable  that  is  incompletely  observed.  Latent  variables  can  be 
introduced  into  binary  outcome  models  in  two  different  ways.  In  the  first  the  latent 
variable  is  an  index  of  an  unobserved  propensity  for  the  event  of  interest  to  occur. 
In  the  second  the  latent  variable  is  the  difference  in  utility  that  occurs  if  the  event 
of  interest  occurs,  which  presumes  that  the  binary  outcome  is  a  result  of  individual 
choice.  The  latter  method  makes  clear  the  need  to  distinguish  between  regressors  that 
vary  across  alternatives  for  a  given  individual  and  regressors  such  as  socioeconomic 
characteristics  that  for  a  given  individual  are  invariant  across  alternatives. 

It  should  be  stressed  that  the  binary  outcome  is  Bernoulli  distributed,  as  in  Sec¬ 
tion  14.3.  Latent  variable  models  merely  provide  a  rationale  for  a  particular  functional 
form  for  the  Bernoulli  parameter. 

Latent  variable  models  do  provide  extensions  to  multinomial  outcomes  and  cen¬ 
sored  outcomes  (detailed  in  Chapters  15  and  16).  They  also  provide  a  framework  that 
permits  Bayesian  analysis  using  data  augmentation  (see  Section  13.7).  Brief  discus¬ 
sion  of  Bayesian  analysis  of  binary  and  multinomial  data  is  given  in  Sections  15.7.2 
and  15.8.2. 


14.4.1.  Index  Function  Models 


In  the  index  function  formulation  interest  lies  in  explaining  an  underlying  unobserved 
continuous  random  variable  y*,  but  all  we  observe  is  the  binary  variable  y,  which  takes 
value  1  or  0  according  to  whether  or  not  y*  crosses  a  threshold.  Different  distributions 
for  y*  lead  to  different  binary  outcome  models. 

Let  y*  be  a  latent  (or  unobserved)  variable,  such  as  the  desire  to  work  if  labor  supply 
is  being  modeled.  The  natural  regression  model  for  y*  is  the  index  function  model 

y*  =  x'(3  +  u.  (14.15) 


However,  this  model  cannot  be  estimated  as  y*  is  not  observed.  Instead,  we  observe 

(14.16) 


y 


1  if  y*  >  0, 
0  if  y*  <  0, 


where  the  threshold  of  zero  is  a  normalization  explained  in  the  following. 
Given  (14.16), 


Pr[y  =  l|x]  =  Pr[y*  >  0]  (14.17) 

=  Pr[x'/3  +  u  >  0] 

=  Pr[— m  <  x'/3] 

=  F(x'P), 

where  F  is  the  cdf  of  —  u,  which  equals  the  cdf  of  u  in  the  usual  case  of  density 
symmetric  about  0. 

The  index  function  model  therefore  provides  motivation  for  the  functional  form  of 
F(-)  in  (14.1). 
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Probit  and  Logit  Models 

The  probit  model  arises  if  the  error  u  is  standard  normal  distributed,  since  then  (14.17) 
yields  Pr[— u  <  x'(3]  =  0(x'/3),  where  <!>(•)  is  the  cdf  of  the  standard  normal. 

Now  introduce  the  logistic  distribution.  In  its  standard  form  the  logistic  has  cdf 

A(k)  =  <?"/(l  +  eu),  -oo  <  u  <  oo.  (14.18) 

The  density  function  A'(u)  =  e“ /( I  +  e“)2  is  symmetric  about  0,  and  a  logistic  random 
variable  has  mean  0  and  variance  7r2/3  ~  1.81 42 . 

The  logit  model  arises  if  the  error  u  is  logistic  distributed,  since  then  (14.17)  yields 
Pr[— w  <  x'/3\  =  A(x'/3).  Note  that  (3  is  scaled  differently  in  the  two  models  due  to 
different  V[u]. 


Identification  Considerations 

Identification  of  the  single-index  model  requires  a  restriction  on  the  variance  of  u,  as 
the  single-index  model  can  only  identify  (3  up  to  scale.  All  that  is  observed  is  whether 
or  not  y*  >  0,  or  equivalently  whether  or  not  x'/3  +  u  >  0.  However,  this  is  equivalent 
to  whether  or  not  x'/3+  +  u+  >  0,  where  f3+  =  a/3  and  u 1  =  an  for  any  a  >  0.  Plac¬ 
ing  a  restriction  on  the  variance  of  the  error  (u  or  u+)  secures  uniqueness  of  f3.  The 
error  variance  is  set  to  one  in  the  probit  model  and  7r2/3  in  the  logit  model. 

The  threshold  for  the  index  model  need  not  be  zero.  If  more  generally  y  =  1  when 
y*  >  i!8  then  (14.17)  becomes  Pr[y  =  1]  =  Fix’ (3  —  z'6).  Then  8  can  be  separately 
identified  if  and  only  if  all  components  of  z  and  x  differ.  In  particular,  if  both  x  and 
z  include  intercepts  these  cannot  be  separately  identified,  so  we  normalize  the  thresh¬ 
old  intercept  to  be  zero.  Note  also  that  the  mean  of  the  error  distribution  needs  to  be 
normalized.  For  the  logit  and  probit  models  it  is  set  to  zero. 

Discussion 

The  index  function  model  implies  a  direct  interpretation  of  (3  as  the  change  in  the 
latent  variable  y*  when  x  changes  by  one  unit.  Even  though  y*  is  unobserved,  this 
interpretation  is  meaningful  if  one  uses  knowledge  of  the  specified  variance  of  u.  For 
example,  a  slope  parameter  of  0.5  in  the  probit  model  means  a  one-unit  change  in 
the  regressor  leads  to  a  0.5  standard  deviation  change  in  y*,  since  in  this  model  the 
variance  of  y*  equals  1 . 

Commonly  used  extensions  of  the  index  function  approach  are  to  ordered  discrete 
choice  models  (see  Section  15.9)  and  to  models  for  censored  and  selected  samples  (see 
Chapter  16). 


14.4.2.  Random  Utility  Models 

In  the  random  utility  formulation  a  consumer  chooses  between  alternatives  0  and  1 
according  to  which  has  the  higher  satisfaction  or  utility.  The  discrete  variable  y  then 
takes  value  1  if  alternative  1  has  higher  utility,  and  it  takes  value  0  if  alternative  0  has 
higher  utility. 
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The  additive  random  utility  model  (ARUM)  specifies  the  utilities  of  alternatives 
0  and  1  to  be 


Uo=V0  +  e0,  (14.19) 

Ui  =  V\+  Ei, 

where  Vo  and  V\  are  deterministic  components  of  utility  and  s o  and  £i  are  random 
components  of  utility.  A  simple  example  is  Vo  =  x'/30  and  Vi  =  x'/3 , ,  though  from 
Section  14.4.3  only  (J3l  —  f30 )  is  then  identified. 

The  alternative  with  higher  utility  is  chosen.  We  observe  y  =  1,  say,  if  Ui  >  Uq. 
Owing  to  the  presence  of  the  random  components  of  utility  this  is  a  random  event 
with 

Pr[y  =  1]  =  Pr[t/i  >  t/0]  (14.20) 

=  Pr[V|  +  ei  >  Vo  +  £ol 
=  Pr[£o  -  £|  <  Vi  -  Vo] 

=  F(V i  -  Vo), 

where  F  is  the  cdf  of  (eq  —  £i).  This  yields  Pr[y  =  1]  =  F(x'/3)  if  Vi  —  Vq  =  x'/3. 

The  ARUM  requires  a  scale  normalization  since  if  U\  >  Uq  then  ciU\  >  aUo.  This 
is  usually  done  by  specifying  the  variance  of  sq  —  £\  or  the  variances  of  sq  and  £*. 

Different  specifications  for  the  distributions  of  eq  and  E\  give  different  F(-)  and 
hence  different  discrete  choice  models.  The  random  utility  formulation  is  especially 
useful  for  specifying  unordered  multinomial  choice  models  (see  Section  15.5). 


Probit  and  Logit  Models 

An  obvious  choice  for  error  distribution  in  (14.19)  is  that  sq  and  s\  are  normal.  Then 
(eo  —  £j)  is  normally  distributed.  Normalization  of  the  variance  of  (eq  —  £i)  to  unity 
gives  the  probit  model  since  then  F(-)  in  (14.20)  is  the  standard  normal  cdf. 

Now  introduce  the  type  1  extreme  value  distribution  or  log  Weibull  distribution. 
Then  the  random  variable  £  has  density 

f(s)  =  e~e  exp(— e~e),  —  oo  <  £  <  oo,  (14.21) 

and  cdf  F(s )  =  exp(— e~e).  The  extreme  value  distributions,  rarely  used  in  economet¬ 
rics,  are  obtained  as  limiting  distributions  as  N  — >  oo  of  the  maximum  of  N  random 
variables  drawn  from  the  same  distribution.  The  type  1  extreme  value  distribution  is 
a  special  case  that  is  right-skewed  over  (— oo,  oo)  with  most  of  the  mass  between  —2 
and  5.  It  has  median  —  ln(—  ln(0.5))  ~  0.36651,  mean  T'll)  —  0.57722,  where  T'(x) 
denotes  the  derivative  of  the  gamma  function,  and  variance  tt2/6  ~  1.282552.  The 
distribution  is  well  approximated  by  a  log-normal. 

The  logit  model  arises  if  eq  and  £\  are  assumed  to  be  independent  type  1  extreme 
value  distributed.  Then  the  difference  (£o  —  £i)  can  be  shown  to  be  logistic  distributed 
(see  Johnson  and  Kotz,  1970),  so  F(-)  in  (14.20)  is  the  logistic  cdf. 

An  alternative  derivation  of  this  result,  working  directly  with  the  extreme  value 
distribution,  is  given  later  in  Section  14.8.  The  derivation  indicates  the  difficulty  in 


477 


BINARY  OUTCOME  MODELS 


obtaining  closed-form  solutions  for  probabilities  when  the  ARUM  is  extended  to 
choice  among  three  or  more  alternatives  in  Section  15.5.  Recent  computational  ad¬ 
vances  permit  estimation  even  in  the  absence  of  a  closed-form  solution. 


14.4.3.  Alternative-Varying  Regressors 

In  most  applications  of  binary  choice  models,  some  regressors  vary  across  individuals, 
but  regressors  do  not  necessarily  vary  across  alternatives. 

At  the  one  extreme  regressors  do  not  vary  across  alternatives.  For  example,  in  labor 
supply  models  of  the  decision  to  work,  socioeconomic  characteristics  such  as  income 
and  gender  do  not  vary  across  alternatives.  A  potential  regressor,  the  wage  rate,  does 
vary  across  the  alternatives  of  work  or  not  work,  but  this  regressor  is  usually  not  in¬ 
cluded  as  it  is  only  observed  for  those  who  choose  to  work. 

At  the  other  extreme  all  regressors  may  vary  across  alternatives.  For  example,  in 
transportation  mode  choice  models  the  regressors  may  be  the  time  cost  and  money 
cost  of  the  two  models  of  transportation. 

A  general  hybrid  ARUM  defines  the  deterministic  components  of  utility  in  (14.19) 
to  be 


Vij  =  z  \jOLj  +  w-7;,  j  =  0,1,  (14.22) 

where  z,;-  are  regressors  that  take  different  values  across  the  two  alternatives,  whereas 
w,  are  individual  characteristics  that  do  not  vary  with  the  choice.  Then  (14.20)  yields 

Pr[y;  =  1]  =  F{i!ncx\  -  zi0'a0  +  w'(7j  -  70)). 

For  alternative-invariant  regressors  only  the  parameter  difference  (7,  —  70)  can  be 
identified.  For  alternative-varying  regressors  that  do  vary  across  alternatives  and 
across  individuals  the  coefficients  can  vary  over  alternatives,  but  it  is  customary  to  set 
« 1  =  a.Q  =  a.  For  example,  the  loss  of  utility  resulting  from  a  one-dollar  increase  in 
travel  costs  is  expected  to  be  the  same  across  different  transportation  modes.  Thus  the 
ARUM  leads  to 


Pr[y,  =  1]  =  F(( zn  -  zf0)'a  +  wj(7l  -  7o))>  (14-23) 

which  is  the  original  binary  choice  model  (14.1)  where  the  regressors  are  alternative- 
invariant  regressors  w  and  the  difference  across  alternatives  of  alternative-varying  re¬ 
gressors  z. 


14.5.  Choice-Based  Samples 

Choice-based  sampling  arises  whenever  selection  of  the  sample  is  determined  in  part 
by  values  taken  by  the  dependent  variable  y,  rather  than  being  completely  random  or 
being  based  in  part  by  values  taken  by  x. 

Discrete  data  models  are  a  leading  example  since  surveys  often  deliberately  over¬ 
sample  choices  that  are  made  infrequently.  For  example,  if  few  people  choose  to 
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commute  by  bus,  an  oversampling  of  bus  riders  may  be  undertaken.  In  the  medical 
literature  the  same  problem  arises  with  case-control  analysis  where,  for  example,  a 
binary  data  analysis  may  be  based  on  a  full  sample  of  those  who  had  a  heart  attack  and 
a  subsample  of  people  with  similar  characteristics  who  did  not  have  a  heart  attack.  The 
standard  term  choice-based  sampling  is  a  little  misleading  since  it  does  not  arise  from 
individual  choice. 

To  see  the  inconsistency  of  standard  binary  choice  methods,  consider  estimation 
of  the  logit  model  when  the  only  regressor  is  the  intercept.  Then  A(x-/3)  =  A  (ft) 
and  the  logit  MLE  first-order  conditions  become  N  1  JT(y,  —  A(ft))  =  0,  so  ft  = 
ln(y/(l  —  >’))•  Consistency  of  ft  clearly  requires  a  random  sample  because,  for  ex¬ 
ample,  oversampling  y  =  1  leads  to  overestimation  of  y  and  hence  ft. 

Methods  to  obtain  consistent  estimates  given  endogenous  sampling  such  as  choice- 
based  sampling  are  covered  in  detail  in  Section  24.4.  Analysis  is  straight-forward  if 
the  degree  of  oversampling  is  known.  Let  Q\  denote  the  fraction  of  the  population 
with  y  =  1  and  H \  =  y  denote  the  fraction  of  the  sample  with  y  =  1.  Similarly  de¬ 
fine  go  =  1  —  g l  and  Hq  =  \  —  H\.  Then  consistent  estimation  is  possible  using  the 
weighted  MLE  proposed  by  Manski  and  Lerman  (1977).  For  binary  outcome  models 
this  maximizes  the  weighted  log-likelihood 


£v(/3)  =  E 


i=l 


t)  *i"  *(*;»+ (| 


(1  -  y,)ln(l  -  F(xft 3)) 


For  example,  if  outcomes  y  =  1  are  oversampled,  then  Q  \ / H\  <  1  and  the  oversam¬ 
pled  observations  with  y  =  1  are  downweighted.  This  estimator  is  easily  implemented 
using  any  program  for  binary  outcome  models  that  permits  weighting  of  observations. 
Then  observations  with  y  =  1  are  given  weight  Q\/H\  and  observations  with  y  =  0 
are  given  weight  Qo/Hq. 

A  detailed  summary  of  ML  methods  for  choice-based  sampling  of  binary  and 
multinomial  data,  including  methods  when  Q\  and  go  are  unknown,  is  given  in 
Amemiya  (1985,  Section  9.5).  The  weighted  MLE  is  inefficient  but  simple  to  imple¬ 
ment  and  the  efficiency  loss  may  not  be  great.  Manski  and  McFadden  (1981a)  pro¬ 
posed  a  variation  that  is  more  efficient  (see  Amemiya  and  Vuong,  1987).  Cosslett 
(1981a, b)  proposed  further  refinements  that  are  fully  efficient  but  impractical  to  im¬ 
plement.  Imbens  (1992)  and  Lancaster  and  Imbens  (1996)  proposed  GMM  estimation 
as  an  alternative  method  that  is  feasible  to  implement  and  is  fully  efficient.  King  and 
Zeng  (2001)  give  a  summary  for  the  binary  logit  model;  additionally,  they  consider 
small-sample  corrections  that,  even  with  oversampling,  make  a  difference  when  the 
population  probability  of  interest  occurs  with  low  probability.  For  further  details  see 
Section  24.4. 

The  epidemiological  literature  has  focused  on  the  logit  model  for  case-control 
studies.  The  method  is  attributed  to  Prentice  and  Pyke  (1979).  See  Breslow  (1996), 
especially  his  Section  4.3,  which  discusses  links  between  the  econometrics  and 
epidemiological  literature. 
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14.6.  Grouped  and  Aggregate  Data 

In  some  applications  only  grouped  or  aggregate  data  may  be  available,  yet  individual 
behavior  is  felt  to  be  best  modeled  by  a  binary  choice  model.  Grouping  poses  no  prob¬ 
lem  when  the  grouping  is  based  on  unique  values  of  the  regressors  and  there  are  many 
observations  per  unique  value  of  the  regressors.  We  begin  with  this  simple  example 
before  moving  to  more  realistic  ones. 


14.6.1.  Berkson’s  Minimum  Chi-Square  Estimator 

Suppose  the  regressor  vector  x,,  i  =  l, . . . ,  N,  takes  only  T  distinct  values,  where 
T  is  much  smaller  then  N .  Then  for  each  value  of  the  regressors  we  have  multiple 
observations  on  y.  This  type  of  grouped  data  is  called  many  observations  per  cell.  It 
can  arise  particularly  in  experimental  data  where  x  is  of  low  dimension  and  is  set  by 
experimental  design  to  just  a  few  values.  Let  xr,  t  =  I , . . . ,  7’.  be  the  T  distinct  values, 
N,  be  the  number  of  observations  on  yt  for  the  /  tli  distinct  value  of  x,  so  Nt  =  A, 
and  p,  be  the  proportion  of  times  y,  =  1  when  x,  =  x,.  Note  that  the  subscript  t  is 
being  used  to  denote  grouping  and  does  not  necessarily  denote  time. 

For  individual  i  with  x,  =  xt,  the  Bernoulli  probability  is 

Pt  =  Myt  =  1|X;  =  X,]  =  F(x't(3),  (14.24) 

as  before.  Inverting  (14.24)  implies  that 

F~\pt )  =  x',/3. 


Now  pi  is  unknown  but  can  be  estimated  by  pt,  so  Berkson  proposed  regressing 
F~1(pt)  on  x,.  Thus  we  estimate  by  LS  the  transformation  model 


F  l(pt)  —  x't(3  +  vt,  t  =  ,T. 


(14.25) 


The  error  term  v,  =  F~l{pt)  —  F  Up,)  is  heteroskedastic  with  variance  that  will  de¬ 
crease  as  N,  increases,  since  then  p,  is  a  better  estimate  of  p,.  and  will  also  depend 
on  the  shape  of  F(-).  By  Taylor  series  expansion  (see  Amemiya  (1981,  p.  1498)  or 
Maddala  (1983,  p.  31)),  v,  has  variance  that  can  be  consistently  estimated  by 


Ml  ~  Pt) 
Nt[F'(F-Hp,))]2' 


(14.26) 


Berkson’s  minimum  chi-square  estimator  (3MC  minimizes  the  weighted  sum  of 
residuals  Yll=i(F~1(pt)  —  x't/3)/(r  j  with  respect  to  (3.  This  is  easily  computed  by  OLS 
regression  of  F~l(pr)/crt  on  x,/at. 

This  estimator  is  simple  to  implement,  as  it  only  requires  an  OLS  package.  Yet  it 
is  fully  efficient,  as  it  can  be  shown  to  have  the  same  asymptotic  distribution  as  the 
MLE  that  treats  each  observation  separately,  rather  than  grouping  them  into  cells  with 
common  regressor  value  x, .  For  the  logit  model  this  estimator  is  especially  simple,  as 
F^Hpr)  =  ln(A/(l  -  pt))  and  a]  =  l/[Ntpr(l  -  p,)]. 

The  advantage  of  the  minimum  chi-square  estimator  is  its  computational  simplicity, 
although  advances  in  computer  power  now  make  this  point  moot.  Grouped  economics 
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data  are  rarely  such  that  there  are  many  observations  within  group  per  unique  value 
of  the  regressors,  unless  the  regressors  are  just  a  few  indicator  variables.  The  method 
does  provide  insights  to  aggregation,  however,  a  topic  we  now  consider. 


14.6.2.  Estimation  with  Aggregate  Data 


Econometrics  examples  of  data  aggregation  include  data  on  the  proportion  of  people 
working  and  data  on  the  proportion  of  those  commuting  by  bus  in  different  regions, 
explained  by  data  on  the  average  characteristics  of  people  in  the  region. 

As  a  concrete  example,  suppose  p,  equals  the  unemployment  rate  in  region  t  and 
x,  equals  the  average  level  of  schooling  in  region  t.  One  possible  model  is  LS  regres¬ 
sion  of  p,  on  x,.  Because  0  <  pt  <  1,  many  studies  instead  transform  to  a  dependent 
variable  that  is  unbounded,  estimating  the  model 


In 


=  x',/3  +  LI,, 


(14.27) 


where  u,  is  an  error. 

This  model  looks  similar  to  the  minimum  chi-square  estimator  for  the  logit  model, 
when  F~1(pt )  =  Ini p, /( I  —  /;,)).  However,  it  is  not  because  Berkson’s  estimator  is 
only  appropriate  if  all  regressors  in  the  fth  cell  take  the  same  value.  Here  instead  the 
regressors  can  take  different  values,  as  different  people  in  region  t  will  have  different 
levels  of  schooling. 

To  see  the  consequences  of  aggregation  when  there  is  within-cell  heterogeneity 
in  the  regressors,  suppose  the  individual-level  model  is  an  index  model  (see  Sec¬ 
tion  14.4.1)  with 


y *  =  x//3  +  Hi, 
Ui  ~  Af[0,  1], 


We  choose  to  work  with  normal  errors,  corresponding  to  a  probit  rather  than  logit 
model,  because  it  is  then  possible  to  obtain  analytical  results.  Model  the  heterogene¬ 
ity  as 

x,-  ~  A E,], 


for  individuals  in  cell  t.  This  realistically  permits  variation  across  cells,  and  the  com¬ 
plication  is  that  X,  /  0,  so  there  is  within-cell  heterogeneity.  Then  in  region  t,  condi¬ 
tional  on  (3 ,  fi[,  and  £,, 


Pr[>’«  =  1]  =  Pr[x'/3+n,  >  0] 

X-/3+W,-—  /x'/3) 


=  Pr 


y/l+0"E,0 


-Mid 

Vn-d's,/? 


where  we  use  x-/3  +  u,  ~  J\f[p,'t(3,  (1  +  (3'T,,(3)]  given  the  preceding  assumptions  and 
then  subtract  the  mean  and  divide  by  the  standard  deviation  to  transform  to  a  standard 
normal  variate. 
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By  similar  argument  to  that  leading  to  (14.25)  given  (14.24),  the  underlying 
individual-level  binary  choice  parameters  (3  can  be  consistently  estimated  by  nonlinear 
LS  estimation  of  (3  in  the  regression 


®_10>,) 


yi  +  /3'S,/3 


+  wt. 


(14.28) 


where  yt  and  x,  are  cell  averages  and  S,  is  the  sample  variance  of  x,  in  cell  t .  The  Berk- 
son  minimum  chi-square  estimate  instead  regresses  <t>  1  (yI )  on  x,  and  is  inconsistent 
for  (3  unless  £,  =  0. 


14.6.3.  Discussion 

Aggregation  issues  are  much  more  complicated  in  nonlinear  models.  If  the  original 
individual-level  model  was  the  linear  model  _y,  =  x' [3  +  u,  with  x,  ~  J\f[/J,t,  X,  ]  in 
the  fth  cell,  then  the  corresponding  linear  regression  of  y,  on  x,  would  yield  a  con¬ 
sistent  estimate  of  (3.  With  nonlinear  models  similar  aggregation  leads  to  inconsistent 
estimation  of  individual-level  parameters,  unless  adjustment  such  as  that  in  (14.28)  is 
undertaken.  Furthermore,  the  example  in  Section  14.6.2,  due  to  McFadden  and  Reid 
(1975),  is  unusual  in  that  aggregation  of  a  nonlinear  model  leads  to  tractable  results. 
This  example  is  discussed  in  considerable  detail  by  Cameron  (1990),  who  considers  it 
in  the  wider  context  of  aggregation  in  nonlinear  models. 

An  active  area  of  aggregation  in  discrete  choice,  usually  multinomial  choice,  is  the 
marketing  literature  on  market  shares  of  branded  goods.  Allenby  and  Ross  (1991) 
present  examples  where  the  bias  of  fitting  aggregate  logit  models  may  not  be  great. 
More  importantly,  recent  computational  advances  permit  estimation  of  individual-level 
parameters  with  aggregate  data  even  if  aggregation  yields  no  closed-form  solution. 
See,  for  example,  Berry  (1994)  and  Nevo  (2001),  who  estimate  models  qualitatively 
similar  to  the  random  parameters  logit  model  in  Section  15.7. 

Finally,  note  that  in  many  applications  with  aggregate  proportions  data,  such  as  un¬ 
employment  rate  by  region,  there  is  no  desire  to  estimate  individual-level  parameters. 
The  only  goal  is  a  reasonable  model  for  dependent  variable  p,  that  lies  between  zero 
and  one.  Then  the  linear  regression  (14.27)  may  be  fine.  The  error  u,  in  (14.27)  will  no 
longer  have  the  variance  given  in  (14.26).  It  will  still  be  heteroskedastic,  however,  so 
statistical  inference  should  be  based  on  White  heteroskedastic-robust  standard  errors. 


14.7.  Semiparametric  Estimation 

The  binary  outcome  model  is  perhaps  the  leading  example  of  semiparametric  re¬ 
gression.  Most  econometrics  studies  presume  a  single-index  form  F(xj/3),  where  the 
functional  form  for  F  is  not  specified.  The  goal  is  to  obtain  an  estimate  of  (3  that 
is  consistent  for  (3,  ideally  ^/V -consistent  and  asymptotically  normal,  while  F(-)  is 
viewed  as  a  nuisance  function.  The  single-index  model  semiparametric  estimators  of 
Section  9.7.4  can  be  applied.  Additional  estimators  exploit  the  index  function  model 
interpretation  for  binary  outcomes.  In  addition,  semiparametric  ML  estimation  that 
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attains  the  semiparametric  efticienccy  bounds  is  possible  with  little  need  for  additional 
assumptions,  since  it  is  clear  that  the  distribution  is  Bernoulli  and  only  F(x'./ 3)  is  not 
known. 


14.7.1.  Semiparametric  Conditional  Mean  Estimation 

The  estimation  problem  in  general  is  one  where  the  dependent  variable  y,  takes  value 
0  or  1  with  conditional  mean 


E[v,  |x,]  =  m(x,), 

where  m(-)  is  unknown.  Note  that  in(x,)  also  equals  the  conditional  probability  that 

y,  =  1- 

The  nonparametric  regression  methods  of  Sections  9.4-9. 6  can  be  applied,  despite 
the  binary  nature  of  the  dependent  variable.  This  is  easily  seen  from  Figure  14.1,  a 
scatterplot  of  binary  variable  y  on  scalar  regressor  x,  a  natural  candidate  for  kernel 
regression  of  y  on  a.  The  fitted  values  will  lie  between  0  and  1,  aside  from  unusual 
cases  such  as  when  higher  order  kernels  are  used,  in  which  case  the  fitted  variable  can 
take  negative  values. 

In  many  microeconometrics  applications  x  is  of  too  high  a  dimension  for  nonpara¬ 
metric  methods  to  work  well  (the  curse  of  dimensionality).  Semiparametric  regression 
models  that  partially  specify  m(-)  are  given  in  Section  9.7.  Additive  models  are  fairly 
popular  in  statistical  applications.  In  econometrics  single-index  models  are  instead 
used,  since  a  popular  starting  point  is  the  index  function  model  of  Section  14.4.1.  This 
yields  a  single-index  model  if  the  latent  variable  y*  =  x'(3  +  u.  Thus  we  suppose 

E[y;|x,]  =  F(x'/3), 

where  we  follow  the  notation  of  this  chapter  and  use  F(-)  rather  than  g(-)  to  denote  the 
unknown  function. 

From  Section  9.7.4,  [3  is  only  identified  up  to  location  and  scale.  This  is  also  clear 
from  Section  14.4.1,  where  the  error  u  in  the  index  model  was  normalized  to  have 
mean  0  (location)  and  the  variance  needed  to  be  specified  (scale).  Here  restrictions  are 
not  placed  on  u,  so  (3  is  not  completely  identified  but  the  ratios  of  slope  coefficients 
are  identified.  See  Manski  (1988b)  for  a  detailed  analysis  of  identification  in  binary 
choice  models. 

y/V-consistent  asymptotically  normal  estimates  of  (3  can  be  obtained  by  average 
derivative  estimation  or  by  semiparametric  least  squares  (see  Section  9.7.4).  However, 
alternative  estimators,  specific  to  binary  outcomes,  are  more  often  used. 

14.7.2.  Maximum  Score  Estimation 

Semiparametric  estimators  for  binary  outcomes  are  often  based  on  the  index  function 
model  y*  =  x'fi  +  u  for  binary  outcomes.  In  such  cases  it  is  convenient  to  write  the 
model  as 

y,  =  l(x'/3  +  u,  >  0), 
where  1(A)  =  1  if  event  A  occurs. 


483 


BINARY  OUTCOME  MODELS 


Manski  (1975)  noted  that  the  predicted  value  of  y,  is  l(x'/3  >  0),  setting  u,  =  0 
since  ut  is  unknown,  in  which  case  a  score  of  the  number  of  correct  predictions  is 

N 

Sn(P)  =  YttoiMP  >  0)  +  (1  -  yd <  0)},  (14.29) 

1  =  1 

since  correct  predictions  occur  if  v,  =  1  and  l(x-/3  >  0),  or  if  y,  =  0  and  l(x-/3  <  0). 
Manski’s  maximum  score  estimator  /3MS  maximizes  Sn(/3).  This  is  a  nonstandard 
problem  because  l(x-/3  >  0)  is  not  differentiable  in  (3.  Manski  (1975,  1985)  estab¬ 
lished  consistency  assuming  F( 0)  =  0.5,  or  equivalently  that  Median  [w,  |x,  ]  =  0.  It 
has  subsequently  been  shown  that  Nl^(/3ms  —  (3)  has  a  nonnormal  limit  distribu¬ 
tion,  though  inference  can  be  performed  using  the  bootstrap  (Manski  and  Thompson 
(1986)). 

Manski’s  estimator  can  be  viewed  as  a  least  absolute  deviations  estimator.  From 
Section  4.6.2,  the  LAD  estimator  minimizes  the  sum  of  absolute  differences  between 
y,  and  Median[y,  |x,  ].  This  less  familiar  estimator  is  qualitatively  similar  to  the  LS  es¬ 
timator,  which  minimizes  the  sum  of  absolute  differences  between  y,  and  E[y,  |x,]. 
To  implement  LAD  here  requires  obtaining  Median[y;|x,].  If  Mcdian[//,jx,  ]  =0 
then  Mcdian[ y*  [x,  I  =  x'/3,  so  Median[y,jx,]  =  l(x'/3  >  0).  Thus  the  binary  outcome 
model  LAD  estimator  minimizes 

N 

Qn{[3)=  ^|y,--l(xj/3>0)|.  (14.30) 

<■= i 

From  Exercise  14.4  Qn((3 )  =  N  —  Sn(]3),  so  the  maximum  score  estimator  equals  the 
LAD  estimator.  See  Manski  (1985,  p.  320)  for  other  interpretations  of  the  maximum 
score  estimator  as  a  LAD  estimator. 

The  objective  function  Sn(]3)  for  the  maximum  score  estimator  given  in  (14.29)  is 
not  differentiable.  It  can  be  rewritten  as 

N  N 

Sn((3 )  =  £(2 y,  -  DKxj/3  >  0)  +  N  -  £  y,-, 

1=1  i= 1 

see  Exercise  14.4.  The  second  sum  can  be  ignored  as  it  does  not  involve  (3. 

An  estimator  with  differentiable  objective  function  is  the  smooth  maximum  score 
estimator  of  Florowitz  (1992)  that  maximizes 

N 

^(/3)=^(2y(.-l)^(xj/3/M, 

1  =  1 

where  K(x' ft/h^)  is  a  smoothed  version  of  Kx'fl  >  0).  Since  l(x'/3  >  0)  equals  zero 
for  negative  values  of  x! (3  and  equals  one  for  positive  values  of  xfi  it  is  natural  to 
choose  K(-)  to  be  a  cdf  with  K (0)  =  0.5  and  choose  hN  to  be  small.  Smoothing 
simplifies  computation  of  the  estimator,  but  analysis  is  complicated  by  the  need  to 
have  // — >  0  at  appropriate  rate  as  N  oo.  The  estimator  converges  at  rate  close  to 
\/~N  and  is  asymptotically  normal.  For  details  see  Florowitz  (2002),  who  presents  a 
bootstrap  that  permits  tests  with  better  size  properties  in  finite  samples. 
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LAD  estimation  can  be  extended  to  the  censored  regression  model  (see  Sec¬ 
tion  16.9.2). 


14.7.3.  Maximum  Rank  Correlation  Estimator 

Begin  with  a  single-index  model  with  E[y,  |x,]  =  Fix' /l).  If  F(x'/3)  is  monotonically 
increasing  in  xJ/3,  then  E[y,  |x,]  >  L[y;|x;  ]  if  x'/3  >  x'/3.  Thus  it  is  likely,  though  not 
guaranteed,  that  the  observed  values  y,  >  y7  when  x-/3  >  x'-/3.  This  suggests  choosing 
(3  to  ensure  that  with  high  frequency  y,  >  y  ;  when  x-/3  >  x'/3. 

The  maximum  rank  correlation  (MRC)  estimator  of  Han  (1987)  chooses  (3  to 
maximize 

N  N 

0wRC(/3)  =  E  E 1( ?'■  >  X/)1^;/3  >  Xy/3)  +  Kj;  <  X/OICX;/3  <  x';/3). 

i=l  7=1 
7<i 

The  yth  term  in  this  sum  equals  one  if  y,  >  y  ;  when  x'/3  >  x'-/3  or  if  y,  <  y;  when 
x./3  <  x'  /3,  and  equals  zero  if  instead  there  is  a  sign  reversal  so  that  y,  <  yj  when 
x'/3  >  x'  /3  or  y,  >  y,  when  xJ/3  <  x';/L  The  estimator  is  called  the  maximum  rank 
correlation  estimator  because  Q^R<  Y/3)  is  a  multiple  of  Kendall’s  rank  correlation  co¬ 
efficient  between  y,  and  xj  /3. 

This  estimator  is  \/N -consistent  and  asymptotically  normal  (see  Sherman,  1993). 


14.7.4.  Scmiparamctric  ML  Estimation 


For  binary  choice  data  the  likelihood  function  given  independent  observations  is 
clearly  that  given  in  (14.4).  The  only  complication  is  that  F(-)  is  unknown.  Klein  and 
Spady  (1993)  proposed  the  semiparametric  MLE  that  maximizes 

N 

CN{f3)  =  {.V/  In  F(X;/3)  +  (1  -  y,)ln(l  -  F(^(3))}  , 

1  =  1 


where  F(x-/3)  is  a  nonparametric  estimate  of  F(x'/3). 

This  estimator  is  similar  in  spirit  to  the  WSLS  estimator  of  Ichimura  (1993)  de¬ 
tailed  in  Section  9.7.4,  and  similar  issues  in  computation  arise  with  iteration  between 
computation  of  (3  given  F  and  computation  of  F  given  (3.  Given  the  ML  first-order 
conditions  (14.5),  the  semiparametric  MLE  can  also  be  computed  as  the  solution  to 
the  equations 


E 


F'(x'/3) 


T?  F(X;/3)(1  -  F(X;/3)) 


(y,  -  F(x'/3))x,  =  0, 


which  are  the  same  as  those  for  the  WSLS  estimator  with  weights  u>j  =  F//[F,(1  — 

F,)l 

The  attraction  of  Klein  and  Spady’s  estimator  is  that  it  is  fully  efficient  in  the  sense 
that  it  attains  the  semiparametric  efficiency  bound.  Computation  is  difficult,  however. 
For  details  see  Section  9.7.4,  where  similar  computational  issues  are  discussed  for 
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Ichimura’s  WSLS  estimator,  and  see  Klein  and  Spady  (1993)  and  Pagan  and  Ullah 
(1999,  pp.  283-285). 


14.7.5.  Comparison  of  Semiparametric  Estimators 

Econometricians  focus  on  single-index  models,  and  even  then  there  are  a  multitude  of 
semiparametric  estimators  available  for  the  binary  outcome  model.  None  of  these  esti¬ 
mators  are  particularly  simple  to  implement.  The  objective  functions  can  have  multiple 
optima  and  may  not  be  smooth.  For  example,  Horowitz  (1992)  uses  simulated  anneal¬ 
ing  for  the  smooth  maximum  score  estimator  and  Dorsey  and  Mayer  ( 1995)  use  genetic 
algorithms  to  obtain  the  maximum  score  estimator. 

Interpretation  of  coefficients  is  also  difficult.  For  example  the  maximum  score  esti¬ 
mator  applied  to  the  fishing  mode  data  yielded  intercept  estimate  of  0.776  and  slope  of 
—0.631  (with  bootstrap-estimated  standard  error  of  0.103),  but  these  coefficients  are 
not  directly  comparable  to  those  given  in  Table  14.2.  Indeed,  since  parameter  slope 
estimates  are  only  identified  up  to  scale,  the  semiparametric  estimates  are  most  use¬ 
ful  if  several  coefficients  are  included  in  the  regression  and  coefficient  estimates  are 
compared  to  those  of  a  reference  variable. 

The  maximum  score  and  maximum  rank  correlation  estimators  are  unusual  among 
semiparametric  estimators  in  not  requiring  use  of  smoothing  parameters,  such  as 
choice  of  a  bandwidth,  an  attractive  property.  The  latter  of  these  estimators  is  s/~N - 
consistent. 

In  recent  work  Blundell  and  Powell  (2004)  propose  semiparametric  estimation  with 
endogenous  regressors. 

14.8.  Derivation  of  Logit  from  Type  I  Extreme  Value 

The  derivation  in  Section  14.4.2  of  the  logit  model  from  the  ARUM  used  knowledge 
of  the  statistical  result  that  the  difference  (eo  —  ei)  of  independent  type  1  extreme 
value  random  variables  is  logistic  distributed.  For  completeness  we  provide  a  direct 
derivation  based  on  the  distributions  of  eo  and  s\. 

Rewriting  the  second  line  of  (14.20)  yields 

My  =  1]  =  Pr[£o  <  ei  +  Vi  -  V0]  (14.31) 

=  f-oo  /(£o,  e1)de0del 

=  f-ocf(£^{f-^Vl~Vof^o)deo]  deu 

where  in  the  last  line  eo  and  e\  are  assumed  to  be  independent.  By  specializing  /(eo) 
to  the  type  1  extreme  value  density,  (14.31)  becomes 

Pr[y=  1]  =  /(£,)  |  e~£°  exp(— e_£°)deoJ  de i  (14.32) 

=  JZo  /(£i)  [exp(-e-So)]!l+yi-v°  de ! 

=  JZo  f(e1)exp(-e-^1+vfj'o))dei. 
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Using  the  extreme  value  density  for  S\  in  (14.32)  yields 

Pr[y  =  1]  =  e~e'  exp(— e-£l)exp(— e~^l+v'~Vo))de\  (14.33) 

=  {exp(— e~Si  —  e~<-Sl+v'^v°))\  ds\ 

=  f00^  e~e'  {exp(— e~ei  —  e~Sl e~^Vl~Vo))]  de\ 

—  e~Sl  exp  {—  e~e'(l  +  e~(Vl~Vo))}  de\ 

Since  ae~e  exp(— ae~s)ds  =  1  it  follows  that  e~s  exp(— ae~£)ds  =  1/a.  Us¬ 

ing  this  result  with  a  =  1  +  e~(V>~Vo)  in  (14.33)  yields 

Pr[y  =  1]  =  (1  +  c  ,v''  v'll))  (14.34) 

=  ev'/(ev°  +  ev') 

=  ev'-v«/(l+ev‘~v°). 

Letting  Vi  —  Vq  =  x'/f  yields  the  logit  model. 


14.9.  Practical  Considerations 

Most  packages  provide  probit  and  logit  model  estimators.  The  main  choice  for  the 
practitioner  is  which  model  to  use.  In  practice  there  is  little  difference  in  the  predicted 
marginal  effects  obtained  from  the  two  models,  unless  most  of  the  outcomes  are  zero 
or  most  of  the  outcomes  are  one. 

Semiparametric  estimation  generally  requires  special  coding  in  languages  such  as 
GAUSS,  though  Lindep  implements  the  estimators  of  Manski  and  Klein  and  Spady. 

14.10.  Bibliographic  Notes 

Logit  and  probit  models  are  commonly  used  and  relatively  simple  nonlinear  regression  models 
that  appear  in  many  standard  texts  such  as  Greene’s  (2003).  The  surveys  by  Amemiya  (1981) 
and  McFadden  (1984)  include  all  the  basic  results.  Maddala  (1983)  and  Amemiya  (1985)  pro¬ 
vide  further  details.  The  books  by  Train  (1986)  and  Ben-Akiva  and  Lerman  (1985)  are  particu¬ 
larly  good  for  applications.  These  references  cover  both  binary  and  multinomial  outcomes. 

14.3  Bliss  (1934)  proposed  the  probit  transformation  to  plot  dosage-mortality  curves.  Berkson 
(1951)  popularized  use  of  the  simpler  logit  model. 

14.4  Latent  variable  models  are  especially  popular  in  the  psychometrics  literature. 

14.5  Amemiya  (1985,  Section  9.5)  provides  an  excellent  survey  of  choice-based  sampling  for 
binary  outcome  models.  See  also  Section  24.4. 

14.6  Cameron  (1990)  considers  aggregation  in  binary  outcome  models  and  summarizes  general 
results  of  Kelijian  (1980)  and  Stoker  (1984)  on  estimability  of  individual-level  parameters 
in  nonlinear  models  using  aggregate  data. 

14.7  The  maximum  score  estimator  of  Manski  (1975)  is  a  leading  early  example  of  semipara¬ 
metric  regression.  Semiparametric  methods  for  binary  outcome  models  are  covered  in  the 
books  by  M-J.  Lee  (1996),  Horowitz  (1997),  and  Pagan  and  Ullah  (1999).  The  last  refer¬ 
ence  covers  many  methods. 
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- Exercises - 

14-1  Consider  a  latent  variable  modeled  by  y*  =  x'-/3  +  e,-,  with  s-,  ~  /V[0,  1].  Suppose 
we  observe  only  y  =  1  if  y*  <  U )  and  y  =  0  if  y*  >  Uj,  where  the  upper  limit  1/, 
is  a  known  constant  for  each  individual  (i.e.,  data)  and  may  differ  over  individuals. 

(a)  Find  Pr [y  =  1|x,j.  [Hint:  Note  that  this  differs  from  the  standard  case  both 
due  to  presence  of  U,  and  because  the  equalities  are  reversed  with  y  =  1 
if  y*  <  U,.] 

(b)  Provide  details  on  an  estimation  method  to  consistently  estimate  f3. 

(c)  Suppose  you  estimate  this  model  and  find  that  the  third  regressor  x3i  has 
estimated  coefficient  p3  =  0.2.  Provide  a  meaningful  interpretation  of  p3. 

14-2  Consider  the  logit  model  with  Pr[y=  1  Ixy  x2]  =  A(p0  +  Pi><m  +  P2*2i),  where 
A(z)  =  ez/(1  +  e^x. 

(a)  Write  down  the  likelihood  scores  and  information  matrix  in  an  expanded 
form. 

(a)  Use  these  to  derive  Wald  and  LM  score  tests  of  H0  :  p3  =  0. 

(c)  Explain  how  you  would  computationally  implement  the  tests. 

(d)  In  what  sense  is  the  logit  model  intrinsically  heteroskedastic? 

1 4-3  Suppose  we  use  an  index  formulation  for  a  discrete  choice  model  but  it  is  felt 
that  the  latent  variable  is  strictly  positive.  This  is  accommodated  by  supposing 
that  the  latent  variable  y*  has  exponential  density  with  parameter  y,  so  the 
density  f(y*)  is  f(y*)  =  y_1  exp(-y*/y),  with  y  =  exp(x'/3).  We  observe  y  =  1 
if  y*  >  z'a  and  y  =  0  if  y*  <  z'a. 

(a)  Give  the  log-likelihood  function  for  the  observed  data. 

(b)  What  is  the  effect  of  a  one-unit  change  in  Xj,  on  Pr[y  =  1]? 

(c)  Suppose  that  y  =  1  if  y*  >  exp(z'a)  and  x  =  z.  Do  you  see  any  problems  in 
identifying  a  and/or  /3?  Explain  your  answer. 

14-4  Consider  the  maximum  score  estimator  with  objective  functions  Sn(/3)  given  in 
(14.29)  and  Qw(/3)  given  in  (14.30). 

(a)  Show  that  SN(J3)  =  E,[1(y  =  1)  x  1(xj/3  >  0)  +  1(y  =  0)  x  1  (x'/3  <  0)]. 

(b)  Show  that  QN((3)  =  s,[1  (y  =  1)  x  1  (x'/3  <  0)  +  1(y  =  0)  x  1(x'/3  >  0)]. 

(c)  Using  1  (y  =  1 )  =  1  -  1  (y  =  0),  show  that  Qw(/3)  =  N  -  Sn(/3). 

(d)  Using  1  (x'/3  <  0)  =  1  -  1  (x'/3  >  0)  show  that  (14.29)  can  be  rewritten  as 
Sw(/3)  =  E/(2y  -  1)1(x)/3  >  0)  +  N  -  £,y. 

14-5  Use  the  health  expenditure  data  of  Section  16.6.  The  model  is  a  probit  regres¬ 
sion  of  DMED,  an  indicator  variable  for  positive  health  expenditures,  against  just 
one  regressor  for  simplicity,  NDISEASE,  the  number  of  chronic  diseases. 

(a)  Obtain  the  OLS  estimate  of  the  slope  parameter. 

(b)  Obtain  the  probit  estimate  of  the  slope  parameter. 

(c)  Given  part  (b),  obtain  the  marginal  effect  of  chronic  diseases  in  two  ways: 
averaged  over  the  sample  and  evaluated  at  the  sample  average  of  NDIS¬ 
EASE. 

(d)  Obtain  the  logit  estimate  of  the  slope  parameter. 

(e)  Given  part  (d),  obtain  the  marginal  effect  of  chronic  diseases  in  three  ways: 
averaged  over  the  sample,  evaluated  at  the  sample  average  of  NDISEASE, 
and  evaluated  at  A(x'/3)  =  y. 
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(f)  For  the  logit  model  calculate  the  proportionate  change  in  the  odds  ratio 
when  NDISEASE  changes. 

14-6  Continue  the  analysis  of  Exercise  14.5. 

(a)  Compare  the  three  binary  models  on  the  basis  of  statistical  significance  of 
NDISEASE. 

(b)  Compare  the  three  binary  models  on  the  basis  of  the  estimated  marginal 
effect. 

(c)  Compare  the  three  binary  models  on  the  basis  of  the  predicted  probabilities. 

(d)  Compare  the  logit  and  probit  binary  models  on  the  basis  of  log-likelihood. 
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Multinomial  Models 


15.1.  Introduction 

The  preceding  chapter  considered  models  for  discrete  outcome  variables  that  can  take 
one  of  two  possible  values.  Here  we  consider  several  possible  outcomes,  usually  mu¬ 
tually  exclusive.  Examples  include  different  ways  to  commute  to  work  (by  bus,  car,  or 
walking),  various  types  of  health  insurance  (fee-for-service,  managed  care,  or  none), 
different  employment  status  (full-time,  part-time,  or  none),  choice  of  recreational  site, 
occupational  choice,  and  product  choice. 

Statistical  inference  is  relatively  straight  forward  in  principle,  as  the  data  must  be 
multinomial  distributed,  just  as  binary  data  must  be  Bernoulli  or  binomial  distributed. 
Estimation  is  most  often  by  maximum  likelihood  because  the  data  are  clearly  multino¬ 
mial  distributed.  For  some  complications,  however,  moment-based  estimation  is  used 
instead. 

Different  multinomial  models  arise  owing  to  different  functional  forms  for  the  prob¬ 
abilities  of  the  multinomial  distribution,  similar  to  the  differences  between  probit  and 
logit  in  the  binary  case.  A  distinction  is  also  made  between  models  where  regressors 
vary  across  alternatives  for  a  given  individual  and  models  where  regressors  are  con¬ 
stant  across  alternatives.  For  example,  in  transportation  mode  choice  some  regressors, 
such  as  travel  time  or  cost,  will  vary  with  choices  whereas  others,  such  as  age,  are 
choice  invariant. 

The  simplest  multinomial  model,  the  conditional  or  multinomial  logit  model,  is 
quite  straightforward  to  use  but  is  viewed  as  too  restrictive  in  practice,  especially  if 
the  multinomial  outcome  data  arise  from  individual  choice.  For  unordered  outcomes 
less  restrictive  models  can  be  obtained  using  the  random  utility  model.  In  this  model 
the  alternative  with  the  highest  utility  is  chosen,  where  utility  from  each  alternative  is 
the  sum  of  deterministic  and  random  components.  Different  specifications  of  the  ran¬ 
dom  components  lead  to  different  functional  forms  for  choice  probabilities  and  hence 
to  different  multinomial  models.  Additional  models  arise  in  applications  where  some 
structure  can  be  placed  on  the  decision-making  process,  such  as  a  natural  ordering  of 
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alternatives  or  sequencing  of  decisions.  In  practice  many  different  multinomial  models 
are  used. 

Section  15.2  presents  an  application  to  illustrate  the  issues  discussed  in  this  chap¬ 
ter.  General  results  for  multinomial  models  are  given  in  Section  15.3.  The  conditional 
and  multinomial  logit  models  are  presented  in  Section  15.4.  The  additive  random  util¬ 
ity  model  is  presented  in  Section  15.5.  The  nested  logit,  random  parameters  logit, 
and  multinomial  probit  models  are  the  subject  of  Sections  15.6-15.8.  Ordered  and  se¬ 
quential  models  are  detailed  in  Section  15.9.  Multivariate  models  with  more  than  one 
discrete  outcome  variable  are  presented  in  Section  15.10.  Semiparametric  estimators 
are  briefly  reviewed  in  Section  15.11. 

15.2.  Example:  Choice  of  Fishing  Mode 

This  section  illustrates  multinomial  logit,  the  simplest  unordered  multinomial  model, 
and  variations  detailed  in  Section  15.4  that  permit  regressors  to  vary  across  alterna¬ 
tives.  The  emphasis  is  on  interpretation  of  estimated  models.  The  marginal  effect  of 
a  change  in  a  regressor  is  more  complicated  than  the  usual  impact  on  a  single  condi¬ 
tional  mean.  For  multinomial  data  there  is  instead  a  separate  marginal  effect  on  the 
probability  of  each  outcome,  and  these  marginal  effects  sum  to  zero  since  probabilities 
sum  to  one. 

The  application  is  to  choice  of  fishing  mode.  The  dependent  variable  y  takes  value 
1,  2,  3,  or  4  depending  on  which  of  the  four  mutually  exclusive  alternative  modes 
of  fishing  -  respectively,  beach,  pier,  private  boat,  and  charter  boat  -  is  chosen.  An 
unordered  multinomial  model  such  as  multinomial  logit  is  appropriate,  since  there  is 
no  clear  ordering  of  the  outcome  variable.  Regressors  are  individual  income,  which 
does  not  vary  with  fishing  mode,  and  price  and  catch  rate,  which  do  vary  by  fishing 
mode  and  across  individuals. 

The  sample  of  1,182  people  comes  from  a  survey  conducted  by  Thomson  and 
Crooke  (1991)  and  analyzed  by  Herriges  and  Kling  (1999).  The  data  are  summarized 
in  Table  15.1,  which  gives  averages  for  the  subsamples  of  people  who  chose  each  of 
the  modes  as  well  as  the  overall  sample  average  of  regressors. 

15.2.1.  Conditional  Logit:  Alternative-Varying  Regressors 

First  consider  the  role  of  price  and  catch  rate,  regressors  that  vary  across  alternatives 
except  that  for  these  data  the  price  of  beach  and  pier  fishing  are  the  same. 

Looking  down  the  columns  of  Table  15.1,  we  see  that  people  tend  to  fish  where  it  is 
cheapest  for  them  to  do  so.  For  example,  for  people  choosing  to  fish  from  the  beach  the 
average  price  was  $36  compared  to  average  prices  of  $36,  $98,  and  $125  for  the  other 
modes.  More  generally,  for  people  choosing  the  beach  and  pier  these  modes  were  on 
average  much  cheaper  than  the  boat  modes,  and  for  people  fishing  from  a  boat  this  was 
on  average  much  cheaper  than  beach  or  pier  fishing.  The  relationship  between  mode 
choice  and  catch  rate  is  less  clear-cut,  though  it  is  clear  that  the  charter  boat  has  the 
highest  catch  rate. 
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Table  15.1.  Fishing  Mode  Multinomial  Choice:  Data  Summary 


Explanatory  Variable 

y  =  i 

Beach 

Sub  sample  Averages 

y  =  2  y  =  3 

Pier  Private 

y  =4 

Charter 

Ally 

Overall 

Income  ($  1,000s  per  month) 

4.052 

3.387 

4.654 

3.881 

4.099 

Price  beach  ($) 

36 

31 

138 

121 

103 

Price  pier  ($) 

36 

31 

138 

121 

103 

Price  private  ($) 

98 

82 

42 

45 

55 

Price  charter  ($) 

125 

110 

71 

75 

84 

Catch  rate  beach 

0.28 

0.26 

0.21 

0.25 

0.24 

Catch  rate  pier 

0.22 

0.20 

0.13 

0.16 

0.16 

Catch  rate  private 

0.16 

0.15 

0.18 

0.18 

0.17 

Catch  rate  charter 

0.52 

0.50 

0.65 

0.69 

0.63 

Sample  probability 

0.113 

0.151 

0.354 

0.382 

1.000 

Observations 

134 

178 

418 

452 

1182 

For  alternative-specific  regressors  that  vary  across  alternatives,  such  as  price  and 
catch  rate,  the  multinomial  logit  model  is  called  a  conditional  logit  model  (see  Section 
15.4.1).  The  probability  of  the  /th  individual  choosing  fishing  mode  j  is  given  by 


Pa  -  P r[y;  =./']  = 


exp  (PpPjj  +  PcCjj) 
E*= i  exp(v 8pPik  +  PcCik)’ 


j  =  I,-- 


-.4, 


where  P  denotes  price,  C  denotes  catch  rate,  the  subscript  i  denotes  the  1  th  individual, 
and  subscript  j  or  lc  denotes  the  alternative.  This  model  is  an  obvious  extension  of 
binary  logit  and  gives  probabilities  that  lie  between  0  and  1  and  sum  to  one.  Other 
multinomial  models  use  a  different  functional  form  for  p/j . 

The  coefficient  estimates  are  given  in  the  CL  column  of  Table  15.2.  For  the  CL 
model,  though  not  for  all  multinomial  models,  the  sign  of  the  coefficient  is  directly 
interpretable.  Anticipating  results  from  Section  15.4.3,  since  Pp  <  0  we  have  that  an 
increase  in  the  price  of  one  alternative  decreases  the  probability  of  choosing  that  al¬ 
ternative  and  increases  the  probability  of  choosing  other  alternatives.  Similarly,  since 
Pc  >  0,  an  increase  in  the  catch  rate  for  one  alternative  increases  choice  probability 
for  that  alternative  and  decreases  the  choice  probability  for  other  alternatives. 

A  standard  measure  of  the  impact  of  changes  in  regressors  is  N  1  E^=i  9 Pij /dXikr, 
the  average  marginal  response  of  the  probability  of  choosing  alternative  j  when  the 
rth  regressor  increases  by  one  unit  for  alternative  k  and  is  unchanged  for  the  other 
alternatives.  For  the  CL  model  this  is  estimated  by  N  1  E/=i  'Piji&ijk  ~  'Pik)Pr  (see 
(15.18)),  where  (3  is  the  estimate  of  (3  and  'pij,  j  =  1, . . . ,  in.  are  the  predicted 
probabilities. 

The  average  responses  across  the  four  modes  for  the  two  regressors  price  and  catch 
rate  are  given  in  Table  15.3.  The  table  gives  the  effect  on  choice  probability  of  a  100- 
unit  (or  $100)  change  in  price  and  the  effect  of  a  one-unit  change  in  the  catch  rate.  For 
example,  an  increase  of  $100  in  the  price  of  beach  fishing  leads  to  a  decrease  of  0.272 
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Table  15.2.  Fishing  Mode  Multinomial  Choice:  Logit  Estimates a 


Model  type 


Regressor 

Type 

Coefficient 

CL 

MNL 

Mixed 

Price  (P) 

Specific 

Pr 

-0.021 

- 

-0.025 

Catch  rate  (C) 

Specific 

PcR 

0.953 

- 

0.358 

Intercept 

Invariant 

a  |  :  Beach 

- 

0.0 

0.0 

o?2  :  Pier 

- 

0.814 

0.778 

o?3  :  Private 

- 

0.739 

0.527 

an  :  Charter 

- 

1.341 

1.694 

Income  (I) 

Invariant 

/3ji  :  Beach 

- 

0.0 

0.0 

Pn  :  Pier 

- 

-0.143 

-0.128 

/S/3  :  Private 

- 

0.092 

0.089 

/S/4  :  Charter 

- 

-0.032 

-0.033 

-In  L 

-1311 

-1477 

-1215 

Pseudo-i?2 

0.162 

0.099 

0.258 

a  Type  of  regresssor  is  alternative- specific  (price  and  catch  rate)  or  alternative-invariant  (income).  Outcomes  are 
(1)  beach,  (2)  pier,  (3)  private,  and  (4)  charter.  MLE  estimates  are  for  conditional  logit  (CL),  multinomial  logit 
(MNL),  and  mixed  logit  (Mixed)  models.  MNL  and  Mixed  models  are  normalized  to  base  category  beach.  All 
estimates  except  that  for  /?/4  are  statistically  significant  at  5%. 


in  the  probability  of  fishing  and  an  increase  of  0.119,  0.080,  and  0.068,  respectively,  in 
the  probability  of  fishing  from  a  beach,  a  pier,  a  private  boat,  and  a  charter  boat.  Note 
that  the  changes  in  probabilities  sum  to  zero,  as  expected. 

Calculation  of  these  marginal  effects  and  probabilities  requires  postestimation  com¬ 
putation.  A  back-of-the-envelope  calculation  uses  pj(8jk  —  Pk)fir  for  the  CL  model, 
where  pj  is  the  sample  average  probability.  For  the  effect  of  a  $100  change  in  the 
price  of  beach  fishing  on  the  probability  of  beach  fishing  this  yields  100  x  0.113(1  — 
0.113)  x  (—0.021)  =  —0.21,  compared  to  the  sample  average  value  of  —0.272  in 
the  table.  This  approximation  becomes  less  reasonable  as  probabilities  get  closer 
to  0  or  1 . 

The  results  in  Table  15.3  are  consistent  with  the  view  that  the  greatest  substitu¬ 
tion  is  between  pier  and  beach  fishing  and  between  private  boat  and  charter  boat 


Table  15.3.  Fishing  Mode  Choice:  Marginal  Effects  for  Conditional  Logit  Model" 


One-Unit  Change  in 

$100  Change  in  Price  of  Catch  Rate  for 


Beach  Pier  Private  Charter  Beach  Pier  Private  Charter 


Change  in  Pr[beach] 

-.272 

.119 

.085 

.068 

.126 

-.055 

-.040 

-.032 

Change  in  Prfpier] 

.119 

-.263 

.080 

.064 

-.055 

.122 

-.037 

-.030 

Change  in  Pr |  private] 

.080 

.080 

-.391 

.225 

-.040 

-.037 

.182 

-.105 

Change  in  Pr[charter] 

.068 

.064 

.225 

-.357 

-.032 

-.030 

-.105 

.166 

a  Average  marginal  response  of  the  probability  of  choosing  each  alternative  when  a  regressor  changes  for  one  of 
the  alternatives  and  is  unchanged  for  the  other  alternatives. 
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fishing.  Specifically,  price  increases,  or  catch  rate  decreases,  for  pier  lead  to  sub¬ 
stitution  to  beach,  and  vice  versa.  A  similar  result  holds  for  charter  versus  private 
boat. 

These  choice  probability  changes  are  for  large  changes  in  the  regressors,  given  that 
average  price  is  $86  and  average  catch  rate  is  0.30.  One  can  instead  calculate  elastic¬ 
ities.  Elasticities  for  choice  probabilities  need  to  be  used  with  care,  however,  because 
probabilities  are  bounded  between  0  and  1.  A  change  in  predicted  probability  from 
0.01  to  0.02  will  lead  to  an  elasticity  roughly  50  times  larger  than  that  for  a  change  in 
predicted  probability  from  0.50  to  0.5 1. 


15.2.2.  Multinomial  Logit:  Alternative-Invariant  Regressors 

Now  consider  the  role  of  income,  measured  as  monthly  income  in  thousands  of  dollars. 
From  Table  15.1  it  appears  that  as  income  rises  the  fishing  mode  moves  progressively 
from  pier,  where  average  monthly  income  of  people  fishing  at  a  pier  is  $3,387,  to 
charter  boat  to  beach  and  finally  to  private  boat,  where  the  average  income  is  $4,654. 

Because  income  is  invariant  across  alternatives  the  appropriate  model  is  the  multi¬ 
nomial  logit  model  (presented  in  Section  15.4.1).  This  lets  regressor  coefficients  vary 
across  alternatives,  with 


PU  =  Pr[y;  =./']  = 


exp  (otj  +  fiijlj) 
Yl=  i  exP(«t-  +  Pikh)  ’ 


7  =  1... 


.,4, 


where  I  denotes  income.  A  normalization  of  parameters  is  needed  as  a  consequence 
of  the  restriction  that  probabilities  sum  to  one.  The  empirical  results  set  oq  =  0  and 
Pn  =0. 

The  parameter  estimates  are  given  in  the  MNL  column  of  Table  15.2.  Coefficient 
interpretation  is  more  difficult  than  for  the  CL  logit  model.  In  particular,  for  MNL 
models  a  positive  regression  parameter  does  not  mean  that  an  increase  in  the  regressor 
leads  to  an  increase  in  the  probability  of  that  alternative.  Instead,  interpretation  for 
the  MNL  model  is  relative  to  the  reference  or  base  category  group,  here  beach  as 
the  beach  coefficients  were  normalized  to  zero.  Compared  to  beach  fishing  a  higher 
income  leads  to  reduced  likelihood  of  fishing  from  a  pier  (since  /l/i  =  —0.143  <  0) 
or  a  charter  boat  (since  /S/4  =  —0.032)  and  greater  likelihood  of  use  of  a  private  boat 
(since  /S/3  =  0.092). 

The  magnitude  of  the  response  to  income  changes  can  be  measured  using 
IV-1  X],'Li  dpij/dli ,  the  marginal  effect  averaged  over  individuals.  For  the  MNL  mod¬ 
els  this  is  estimated  by  N~l  ^,=1  'Pij(PJ  —  P,)  (see  (15.19)),  where  /3,  is  the  esti¬ 
mate  of  (3j,  /3,  =  IX |  Pufii  is  a  probability  weighted  average  of  the  f3h  and  /i,; , 

j  =  I . m,  arc  the  predicted  probabilities.  For  the  four  choices  a  $1,000  increase 

in  monthly  income  is  associated  with  changes  of  0.000,  —0.021,  0.033,  and  —0.012 
in,  respectively,  the  probabilities  of  fishing  from  beach,  pier,  private  boat,  and  charter 
boat.  This  indicates  little  change  in  beach  fishing,  movement  out  of  pier  and  charter 
boat  fishing,  and  movement  to  private  boat  fishing.  Since  average  monthly  income  is 
$4,100  the  changes  in  probability  are  of  reasonable  size. 
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However,  income  alone  is  not  a  great  discriminator  for  fishing  mode  choice.  From 
the  bottom  of  Table  15.2,  we  see  that  the  MNL  model  has  much  lower  log-likelihood 
and  pseudo-/*12  than  does  the  CL  model.  From  output  not  given,  across  all  individuals 
in  the  sample  the  predicted  probabilities  from  the  MNL  model  range  from  0.095  to 
0.115  for  beach,  0.036  to  0.234  for  pier,  0.240  to  0.626  for  private  boat,  and  0.244  to 
0.416  for  charter  boat.  Since  an  intercept  is  included  in  the  MNL  model  the  averages 
of  these  predicted  probabilities  for  each  choice  equal  the  sample  average  probabilities. 
This  result  for  the  MNL  model  is  a  consequence  of  (15.16)  given  later. 


15.2.3.  Mixed  Logit 


A  richer  model  combines  the  two  preceding  models.  This  is  done  using  a  so-called 
mixed  logit  model  (see  Section  15.4.1)  with 


Pr[v,-  =  j] 


exp  (fipPjj  +  PcCij  +  dj  +  Pijh) 

exP (PpPik  +  PcCik  +  oik  +  Pikh) 


This  model,  not  to  be  confused  with  the  model  of  Section  15.7  which  is  also  referred 
to  as  a  mixed  model,  can  be  implemented  as  a  conditional  logit  model 

Pr[v.  _  ■]  _  exp (PpPjj  +  pcCjj  +  EjU (aidijt  +  Pndljji)) 

i  exp(v BpPik  +  PcCa  +  I^=i(a/d07  +  Pndljji ))  ’ 

where  Jy/  is  a  dummy  variable  equal  to  one  if  j  =  /  and  equal  to  zero  otherwise, 
and  dljji  =  d,ji  I  j  is  equal  to  income  if  j  =  l  and  equals  zero  otherwise.  In  this  case 
we  regress  yt  on  eight  regressors:  Py ,  Cy,  Jy 2,  dy 3,  <fy4,  J/y2,  c//y3,  and  £//y4. 
Since  a.\  =  0  and  (J>n  =  0  the  regressors  d,j  1  and  dljji  are  omitted.  Note  that  if  we 
estimate  this  CL  model  with  just  the  d\j\  and  dljji  as  regressors  then  the  CL  estimates 
equal  the  MNL  estimates  given  earlier.  An  MNL  model  can  always  be  estimated  as  a 
CL  model  (see  Section  15.3.4). 

While  the  mixed  logit  model  is  richer  than  the  CL  model,  the  CL  model  has  the  ad¬ 
vantage  that  if  an  additional  alternative  was  added  to  the  choice  set  then  one  can  predict 
its  probability  of  selection,  since  the  parameters  of  the  CL  model  do  not  vary  across 
alternatives. 

The  results  are  reported  in  the  last  column  of  Table  15.2.  Compared  to  the  first 
two  models  the  coefficients  are  little  changed,  except  for  considerable  change  in  the 
catch  rate  coefficient.  This  change  is  due  to  inclusion  of  the  alternative- specific  dum¬ 
mies,  rather  than  inclusion  of  income.  The  mixed  model  is  strongly  preferred  to  the 
other  models  on  the  basis  of  much  higher  log-likelihood  value  or  formal  statistical 
tests. 


15.3.  General  Results 

The  results  in  this  section  pertain  to  all  multinomial  models.  The  remainder  of  the 
chapter  specializes  to  the  many  different  specifications  of  the  multinomial  model  used 
in  practice. 
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15.3.1.  Multinomial  Models 


There  are  m  alternatives  and  the  dependent  variable  y  is  defined  to  take  value  j  if  the 
y'th  alternative  is  taken,  j  =  I ,  in .  (Some  authors  instead  consider  m  +  I  alterna¬ 
tives  with  j  =  0,  1, . . . ,  m .)  Define  the  probability  that  alternative  j  is  chosen  as 


Pi  =  pr[}’  =  j],  j  —  l,  ...  ,m. 


(15.1) 


Introduce  m  binary  variables  for  each  observation  y. 


yj  = 


1  if  y  =  j, 
0  if  V  /  j- 


(15.2) 


Thus  yj  equals  one  if  alternative  j  is  the  observed  outcome  and  the  remaining  va-  equal 
zero,  so  for  each  observation  on  y  exactly  one  of  y  i ,  ,  ■  ■ . ,  ym  will  be  nonzero.  The 

multinomial  density  for  one  observation  can  then  be  conveniently  written  as 

m 

f(y)  =  P?  *  ■■■  *  Pm  =Y\  P?-  (153) 

j=l 


For  regression  models  introduce  a  subscript  i  for  the  ith  individual  and  regressors 
Xj.  Specify  a  model  for  the  probability  that  individual  i  chooses  the  / tli  alternative, 


Pi,i  =  Pr[y,  =  ./]  =  Fj(x,,f3),  j  —  ,m,  i  =  l,...,N.  (15.4) 

The  functional  form  for  Fj  should  be  such  that  probabilities  lie  between  0  and  1  and 
sum  over  j  to  one.  Different  functional  specifications  for  Fj  correspond  to  specific 
models,  notably  multinomial  logit,  nested  logit,  multinomial  probit,  ordered,  sequen¬ 
tial,  and  multivariate  models.  These  models  are  presented  in  subsequent  sections. 


15.3.2.  ML  Estimation 


The  multinomial  density  for  one  observation  is  given  in  (15.3).  The  likelihood  function 
for  a  sample  of  N  independent  observations  is  then  L  \-  =  |~[  f= ,  n  "=i  pJj  »  where  the 
subscript  i  denotes  the  ith  of  N  individuals  and  the  subscript  j  denotes  the  j th  of  m 
alternatives.  The  log-likelihood  function  is 

N  m 

£  =  In  Ln=  yj  yu  In  pu ,  (15.5) 

i=l  .7=1 

where  /?,,  =  Fj(Xj,  (3)  is  a  function  of  parameters  / 3  and  regressors,  defined  in  (15.4). 
More  generally,  the  number  of  alternatives  may  vary  across  different  individuals,  so 
that  m  choices  become  /n,  choices. 

The  first-order  conditions  for  the  MLE  (3  are  that  it  solves 


—  =  yu_dPii 

d@  Pu  d@ 


(15.6) 


which  is  usually  nonlinear  in  (3.  The  distribution  of  y,  is  necessarily  multinomial,  so 
correct  specification  of  the  dgp  means  correct  specification  of  the  functional  forms 
Fj(Xj,(3 )  for  the  probabilities  p,j .  This  ensures  consistency  as  then  E [ v, , ]  =  pij, 
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so  taking  the  expectation  of  (15.6)  yields  E[3£/3/3]  =  Yl"j= 1  dPij/dfl,  which 
equals  zero  since  Y-!"j=o  Pij  =  '  • 

The  usual  asymptotic  theory  applies  and  the  variance  matrix  is  minus  the  inverse  of 
the  information  matrix.  Differentiating  the  double  sum  in  (15.6)  with  respect  to  /3'  and 
using  E[y,;  ]  =  /?,,  yields  upon  simplification 


/y'f'  J^dPudPil 
\“T'  PU  dP' 


Pij  \ 1 

3/33/3'  J 


(15.7) 


Provided  observations  are  independent  over  i  there  is  no  need  to  use  more  general 
sandwich  forms  of  the  variance  matrix  since  the  data  are  definitely  multinomial  dis¬ 
tributed  and  the  information  matrix  equality  will  hold. 

As  already  mentioned,  different  models  correspond  to  different  choices  of  Fj (x,- ,  (3) 
for  pij  and  hence  different  expressions  in  (15.6)  and  (15.7). 

Maximum  likelihood  estimation  for  choice-based  samples,  such  as  those  that 
oversample  infrequently  observed  outcomes,  is  presented  in  Sections  14.5  and  24.4. 


15.3.3.  Moment-Based  Estimation 

For  simple  cross-section  applications  the  standard  estimation  procedure  is  the  MLE. 

However,  when  complications  such  as  endogeneity  or  correlation  across  observa¬ 
tional  unit  i  arise,  it  can  be  more  convenient  to  instead  use  moment-based  estimators. 
Assuming  the  probabilities  are  correctly  specified,  we  can  consider  any  estimator  with 
estimating  equations 

N  m 

X  X^d  “  Pu)zi  =  0  (1.5.8) 

1  =  1  7=1 

where  z(  ,  a  vector  of  the  same  dimension  as  /3,  does  not  depend  on  y(  / ,  for  example, 
z,  =  dpij/d/3.  This  estimator  will  be  consistent  if  the  functional  form  for  /?,,  is  cor¬ 
rectly  specified,  as  then  E[y(/]  =  and  the  double  sum  on  the  left-hand  side  of  (15.8) 
has  expected  value  zero.  The  efficiency  of  the  estimator  will  vary  with  the  choice  of  z, 
and  in  the  most  general  case  GMM  estimation  procedures  can  be  used.  The  estimating 
equations  (15.8)  are  the  basis  for  the  method  of  simulated  moments  estimator  for  the 
multinomial  probit  model  (see  Section  15.8.2). 


15.3.4.  Alternative-Varying  Regressors 

Multinomial  regression  models  differ  not  only  in  the  choice  of  function  Fj(-)  in  (15.4) 
but  also  in  how  regressors  and  parameters  vary  across  the  alternatives. 

At  one  extreme  all  regressors  may  be  alternative-varying,  meaning  that  they  take 
different  values  for  different  alternatives.  Let  x,7  denote  the  value  of  the  regressors  for 
individual  i  and  alternative  j,  and  let  x,  =  [xjj  x-2  . . .  xjm]'.  Then  (15.4)  is  usually  of 
the  form 

F;(x„/3)=  F;(x',/3 . x'm/3). 
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where  the  parameters  (3  are  constant  across  alternatives.  An  example  is  the  conditional 
logit  model  defined  later  in  (15.10). 

At  the  other  extreme  all  regressors  may  be  alternative-invariant,  meaning  that 
x,  does  not  vary  across  alternatives.  An  example  is  individual  socioeconomic  char¬ 
acteristics  in  a  model  of  transportation  mode  choice.  Then  (15.4)  is  usually  of  the 
form 


Fj(Xi,/3 )  =  Fjix'iP i, . . . ,  x;/3,„), 

where  the  parameters  /3;  differ  across  alternatives  and  (3  =  [(3\  (3'2  . . .  (3'mY-  Parameter 
identification  requires  a  normalization  such  as  (3l  =0.  An  example  is  the  multinomial 
logit  model  defined  later  in  ( 1 5. 1 1 ). 

The  distinction  between  alternative-varying  and  alternative-invariant  regressors  is 
of  practical  importance,  as  standard  notation  and  computer  programs  for  multinomial 
models  work  exclusively  with  one  or  the  other.  In  practice,  of  course,  some  regressors 
may  be  alternative- varying  and  others  alternative-invariant.  In  such  cases  it  is  best  to 
use  a  program  written  for  alternative-varying  regressors,  as  it  is  possible  to  go  from 
alternative-invariant  regressors  to  the  alternative-varying  format.  Let  x,  be  a  K  x  1 
vector.  Then  define  x,;  to  be  a  Km  x  1  vector  with  zeros  everywhere  except  that  the 
/  th  block  is  Xj,  that  is, 

X,- j  —  [o' •  •  •  o'  x'  O'- ••07, 

and  define  (3  =  [O'  (32  ■  ■  ■  /3'mY,  where  (3X=  0  is  a  normalization.  Then  x'/3;  =  xL/3. 
The  regressors  are  essentially  included  as  interactions  with  alternative-specific  dum¬ 
mies.  An  example  was  given  in  Section  15.2.3.  It  is  also  possible  to  go  from  the 
alternative-specific  to  the  alternative-invariant  format,  but  then  (m  —  1)  parameter 
equality  constraints  need  to  be  imposed  for  each  of  the  alternative-specific  regressors. 


15.3.5,  Revealed  Preference  and  Stated  Preference  Data 

The  multinomial  data  used  in  microeconometric  studies  often  arise  from  individual 
consumer  choice.  Consumer  choice  data  may  be  either  revealed  preference  data, 
which  are  data  on  actual  decisions  and  outcomes,  or  stated  preference  data,  which 
are  survey  responses  to  hypothetical  questions.  An  example  of  revealed  preference  data 
would  be  actual  occupational  choice.  An  example  of  stated  preference  data  would  be 
a  marketing  study  for  fuel-efficient  vehicles  that  asks  a  respondent  to  choose  among 
various  hypothetical  vehicles  that  differ  in  characteristics  such  as  fuel  consumption, 
range,  and  price. 

Revealed  preference  data  often  provide  little  or  no  data  on  alternatives  other  than 
that  chosen.  For  example,  we  may  know  the  price  to  an  individual  consumer  of  the 
chosen  product  but  not  the  prices  of  alternative  products.  The  attraction  of  stated  pref¬ 
erence  data  for  multinomial  modeling  is  that  data  are  available  on  key  variables  such 
as  price  for  all  possible  alternatives.  This  is  particularly  advantageous  if  one  wishes  to 
predict  the  probability  of  choice  or  market  share  of  a  new  alternative  on  the  basis  of 
characteristics  of  the  new  alternative,  as  all  parameters  can  be  alternative-invariant  if 
all  regressors  vary  across  alternatives. 
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There  is  some  controversy  in  using  stated  preference  data,  because  responses  can 
vary  with  the  wording  of  questions.  Moreover,  people  may  overstate  or  understate  their 
willingness  to  pay  to  support  particular  policies.  For  example,  some  might  overstate 
their  willingness  to  support  an  environmentally  friendly  policy. 

Shopping  scanner  data  are  especially  attractive  because  they  provide  data  on  re¬ 
vealed  choice  while  at  the  same  time  data  on  prices  across  all  alternatives  are  also 
provided. 


15.3.6.  Model  Evaluation  and  Selection 

Regression  parameters  in  multinomial  models  can  be  difficult  to  directly  interpret. 
Instead,  it  is  useful  to  consider  the  marginal  effect  (or  elasticities)  of  changes 
in  regressors  on  outcome  probabilities.  Formulas  for  conditional  and  multinomial 
logit  models  are  given  in  Section  15.4.3  and  have  been  used  in  the  Section  15.2 
application. 

Several  model  evaluation  methods  are  presented  in  Amemiya  (1981)  and  Mad- 
dala  (1983).  Using  R 2  measures  based  on  the  analogue  of  squared  residuals  does  not 
work  well.  Comparisons  of  predicted  probabilities  with  actual  outcomes  are  of  lim¬ 
ited  value  as  MNL  models  estimated  with  intercept  impose  in  estimation  the  restric¬ 
tion  that  the  average  of  the  predicted  probabilities  equals  the  sample  average  prob¬ 
abilities  for  each  alternative.  It  can  be  useful  to  look  at  the  range  of  the  in-sample 
fitted  probabilities  for  each  alternative.  The  wider  the  range  the  more  discriminat¬ 
ing  is  the  model.  For  more  detail  see  the  discussion  in  Section  14.3.7  for  binary 
outcomes. 

Multinomial  models  are  usually  estimated  by  maximum  likelihood.  Thus  to  the 
extent  that  models  are  nested  one  can  use  standard  likelihood  ratio  tests.  When  models 
are  nonnested  one  can  use  variants  of  Akaike’s  information  criteria  based  on  the  fitted 
log-likelihood  with  a  degrees-of-freedom  adjustment  for  the  number  of  parameters  in 
the  model  (see  Section  8.5.1). 

A  useful  pseudo-  /U  measure,  due  to  McFadden  (1973),  is 


R2  =  1  —  lnLfit/lnLo,  (15.9) 

where  In  Lh,  denotes  the  fitted  model  and  Lo  denotes  an  intercept-only  model  that 
estimates  the  probability  of  each  alternative  to  be  the  sample  average.  For  any  multi¬ 
nomial  model  the  theoretical  maximum  value  of  the  log-likelihood  is  zero.  This  arises 
if  pij  =  1  when  yi;  =  1  and  /;(/  =  0  otherwise,  for  i  and  j.  Thus  the  R2  measure  can 
be  rewritten  as 


d2  lnLfit-lnLo 

K  =  - . 

In  Tmax  In  Lq 

This  can  be  interpreted  as  the  fraction  of  the  maximum  potential  gain  in  log-likelihood 
that  is  achieved  by  the  fitted  model  (see  Section  8.7.1). 
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15.4.  Multinomial  Logit 

The  simplest  multinomial  model  is  the  multinomial  logit  model,  proposed  by  Luce 
(1959).  The  commonly  used  variants  of  this  model  differ  according  to  whether  or  not 
regressors  vary  across  alternatives.  Many  of  the  issues  presented  in  this  section  pertain 
to  other  models  presented  more  briefly  in  subsequent  sections. 


15.4.1.  Conditional,  Multinomial,  and  Mixed  Logit  Models 


For  alternative-varying  regressors  (see  Section  15.3.4)  the  conditional  logit  model  is 
used.  The  CL  model  specifies 


PH  = 


ex'<iP 


ET=i^’ 


j  =  L 


(15.10) 


Since  exptx',/3)  >  0  these  probabilities  lie  between  0  and  1  and  sum  over  j  to  one. 
Indeed,  once  one  has  seen  the  formula  (15.10)  it  appears  to  be  the  most  simple  speci¬ 
fication  that  ensures  well-behaved  probabilities.  Because  E7=i  =  I  an  equivalent 
model  is  obtained  by  defining  x,  ,  to  be  deviations  of  regressors  from  values  of  alterna¬ 
tive  1 ,  say,  and  settting  x,  i  =  0. 

When  instead  the  regressors  do  not  vary  over  alternatives,  the  multinomial  logit 
model  is  used.  The  MNL  model  specifies 


Pi J  = 


Lm 
1=  1 


j  —  m. 


(15.11) 


Because  E7=i  p,j  =  1,  a  restriction  is  needed  to  ensure  model  identification  and  the 
usual  restriction  is  that  /3l  =  0. 

The  two  models  can  be  combined  into  what  some  authors  call  a  mixed  logit  model, 
with 


Pij 


gX'ijP+v/'iYj 

E/li  e^'P+^w  ’ 


j  =  L  ■  •  ■ ,  m, 


(15.12) 


where  x,;  vary  over  alternatives  and  w,  do  not  vary  over  alternatives.  As  discussed 
in  Sections  15.2.3  and  15.3.4,  the  mixed  and  MNL  models  can  be  reexpressed  as  a 
CL  model.  Note  that  the  term  mixed  logit  model  is  also  sometimes  used  for  a  quite 
different  model  detailed  in  Section  15.7. 

All  these  models  can  be  given  the  general  label  of  multinomial  logit,  but  we  follow 
the  standard  convention  in  distinguishing  between  the  MNL  and  CL  models. 

An  obvious  generalization  of  the  multinomial  logit  model  is 


PU  =  .  j  =  h...,m,  (15.13) 

E/= 1  VH 

where  V,;  >  0  can  be  quite  general  functions  of  regressors  x,  and  parameters  (3.  This 
is  called  the  universal  logit  model.  Although  this  can  generate  a  potentially  rich  class 
of  models  it  is  seldom  used  in  econometrics  as  it  does  not  arise  naturally  from  choice 
theory. 
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15.4.2.  ML  Estimation  of  CL  and  MNL  Models 


We  present  key  formulas  for  the  conditional  logit  and  multinomial  logit  models.  Com¬ 
plete  derivations  are  given  in  Section  15.12. 

For  the  CL  model,  where  pt  j  is  defined  in  (15.10),  dpij/d/3  =  pij ( x,;  —  x,),  where 
x,  =  Yl'"=  i  Puxii  is  a  probability  weighted  average  of  the  regressors  (see  Section 
15.12.1).  The  CL  first-order  conditions,  given  in  (15.6)  for  general  ptj ,  simplify  im¬ 
mediately  to 

N  m 

£I>y(xy-Xi)=  0.  (15.14) 

/  =  !  7=1 


Differentiating  with  respect  to  f3',  using  H  [  v, ,  |  =  ptJ .  and  performing  some  further 
algebra  yields 


PcL  ~  N 


-1 


p> 1 x  X  Pvtej  - 


o'=i  t=i 


(15.15) 


For  the  MNL  model,  ptj  is  defined  in  (15.11)  and  it  is  shown  in  Section  15.12.2  that 
dpij/df3k  =  pij{8jjk  —  Pik)xi,  where  Sip  is  an  indicator  variable  equal  to  1  if  j  =  k 
and  equal  to  0  if  j  ^  k,  and  that  the  resultant  MNL  first-order  conditions  simplify 
after  some  algebra  to 


dC 

Wk 


N 

^2(yHc  ~  Pik)*i  =  o, 

i=  1 


k  —  1 , ,m. 


(15.16) 


As  usual  /3mnl  ~  J\f[/3,  (E[d2C/d/3df3'])  ].  where  further  algebra  shows  that  the  in¬ 

formation  matrix  has  y  Arth  block 


E 


'  3  2C  ' 

JPjdP'k. 


N 

X  Pij(sijk-Pik)*ixi',  j  —  1 . rn,  k  =  1,  . . . ,  m. 

i= 1 


(15.17) 


15.4.3.  Regression  Parameter  Interpretation 

Care  is  needed  in  the  interpretation  of  parameters  in  any  nonlinear  model.  This  is 
particularly  so  for  multinomial  models  where,  for  example,  there  is  not  necessarily  a 
one-to-one  coiTespondence  between  coefficient  sign  and  coefficient  probability.  Here 
we  present  results  used  in  the  Section  15.2  application. 


Marginal  Effects  and  Elasticities 

We  focus  on  marginal  effects  on  the  choice  probabilities  of  a  change  in  the  regressor 
for  a  given  individual.  Elasticities  can  then  be  computed  by  multiplying  the  marginal 
effect  by  the  current  regressor  value  and  dividing  by  the  probability.  Typically  these  are 
then  averaged  over  individuals  to  give  an  average  marginal  effect  or  average  elasticity. 

For  the  CL  model  consider  the  effect  on  the  /ill  probability  of  changing  by  one 
unit  the  value  of  the  regressor  for  the  kth  alternative.  For  example,  what  is  the  effect 


501 


MULTINOMIAL  MODELS 


on  the  probabilities  of  choosing  various  modes  of  transportation  if  travel  time  by  bus 
increases  by  a  minute  whereas  the  travel  time  by  other  modes  is  unchanged?  From 
Section  15.12.1 

^  =  PijiSijk  -  pik)(3,  (15.18) 

dXik 

where  8tjk  was  defined  after  (15.15).  It  follows  that  if  the  regression  coefficient  is 
positive  then  an  increase  in  the  corresponding  component  of  the  regressor  value  for 
the  kth  alternative  increases  the  probability  of  the  fth  alternative  and  decreases  the 
probability  of  the  other  alternatives. 

For  the  MNL  model  consider  instead  the  effect  on  the  /tli  probability  of  changing 
by  one  unit  a  regressor  that  takes  the  same  value  across  all  alternatives.  For  example, 
what  is  the  effect  on  the  probabilities  of  choosing  to  work  if  age  increases  by  one  year? 
From  Section  15.12.2 

dp±=pu(l3j-Pi),  (15.19) 

OX; 

where  /3,  =  X/  PuPi  is  a  probability  weighted  average  of  the  /3/.  It  follows  that  the 
sign  of  the  response  is  not  necessarily  given  by  the  sign  of  /3  ■,  unless  f3j  >  (3k  for 
all  k  A  j,  and  it  does  not  necessarily  make  any  sense  to  test  whether  a  particular  co¬ 
efficient  is  zero.  As  in  other  nonlinear  models  we  may  compute  the  average  response 
N  1  dptj/dxj  =  N~l  X/  Pijiflj  ~  A),  or  we  can  use  noncalculus  methods  and 
compare  the  change  in  the  average  predicted  probability  as  regressors  change. 


Comparison  to  Base  Category 


The  coefficients  in  the  CL  and  MNL  models  can  also  be  given  a  more  direct  logit-like 
interpretation  in  terms  of  relative  risk  (detailed  in  Section  14.3.4).  This  is  because  the 
models  can  be  reexpressed  as  binary  logit  models. 

For  the  MNL  model,  comparison  is  to  a  base  category,  which  is  the  alternative 
normalized  to  have  coefficients  equal  to  zero.  To  see  this  note  that  the  multinomial  logit 
probabilities  (15.11)  imply  that  the  conditional  probability  of  observing  alternative  j 
given  that  either  alternative  j  or  alternative  k  is  observed  is 


My  =  j\y  =  jork]=  pfi-p-k 

=  ex'0> 

exA  +  ex'0t 

gHHPj-Pt) 

~  1  _|_  ex'(/T-/3t)  ’ 


(15.20) 


which  is  a  logit  model  with  coefficient  (J3j  —  (3k).  The  second  equality  comes  af¬ 
ter  some  simplification.  Suppose  normalization  is  on  alternative  1,  so  that  /3 1  =  0. 
Then 

exA 

Pr|  V,  =  j\y,  —  j  or  1]  =  - - 

1  +  e  'pJ 
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and  /3j  can  be  interpreted  in  the  same  way  as  the  logit  model  coefficient  for  binary 
choice  between  alternatives  j  and  1 .  Similarly  to  the  binary  logit  model  the  relative 
risk  of  choosing  alternative  j  rather  than  alternative  1  is 

Pr[y-  -  -/]  =  ewi 

My,  =  H 


and  hence  e^J'  gives  the  proportionate  change  in  this  relative  risk  when  xu-  changes  by 
one  unit.  Such  interpretations  will  vary  according  to  which  alternative  is  normalized  to 
have  zero  coefficient,  and  for  this  interpretation  to  be  really  useful  one  needs  to  have 
a  natural  base  category.  For  example,  if  interest  lies  in  various  alternative  commute 
modes  to  traveling  by  car  then  normalize  the  coefficients  for  the  car  alternative  to  equal 
zero. 

A  similar  approach  can  also  be  applied  to  the  CL  model,  with 

e(x,j-xik)'(3 

Pr[y;  =  j\yt  =  j  or  k]  -  y  +  (15-21) 

and  normalization  now  is  with  respect  to  regressor  values  for  a  base  category. 


15.4.4.  Independence  of  Irrelevant  Alternatives 

A  limitation  of  the  CL  and  MNL  models  is  that  discrimination  among  the  m  alterna¬ 
tives  reduces  to  a  series  of  pairwise  comparisons  that  are  unaffected  by  the  character¬ 
istics  of  alternatives  other  than  the  pair  under  consideration.  This  is  clear  from  (15.20) 
and  (15.21),  which  show  that  the  MNL  model  reduces  to  a  binary  choice  logit  model 
between  any  pair  of  choices.  The  conditional  probability  does  not  depend  on  other 
alternatives. 

As  an  extreme  example,  the  conditional  probability  of  commute  by  car  given  com¬ 
mute  by  car  or  red  bus  is  assumed  in  an  MNL  or  CL  model  to  be  independent  of 
whether  commuting  by  blue  bus  is  an  option.  However,  in  practice  we  would  expect 
introduction  of  a  blue  bus,  which  is  the  same  as  a  red  bus  in  every  aspect  except  color, 
to  have  little  impact  on  car  use  and  to  halve  use  of  the  red  bus,  leading  to  an  increase 
in  the  conditional  probability  of  car  use  given  commute  by  car  or  red  bus. 

This  weakness  of  MNL  is  known  in  the  literature  as  the  red  bus-blue  bus  prob¬ 
lem,  or  more  formally  as  the  assumption  of  independence  of  irrelevant  alternatives. 
It  can  be  tested  by  a  Hausman  test  (see  Hausman  and  McFadden,  1984).  For  exam¬ 
ple,  we  could  compare  the  coefficient  estimates  of  red  bus  in  a  three-choice  model  of 
car,  red  bus,  and  blue  bus,  again  with  car  the  base  category,  with  the  coefficient  esti¬ 
mates  of  red  bus  in  a  binary  choice  model  of  car  and  red  bus,  again  with  car  the  base 
category. 

Much  of  the  econometrics  literature  has  focused  on  alternative  unordered  models 
that  do  not  have  this  weakness.  These  models  are  presented  in  Sections  15.6-15.8. 
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15.5.  Additive  Random  Utility  Models 

Unordered  multinomial  models  more  general  than  multinomial  and  conditional  logit 
can  be  obtained  using  the  general  framework  of  additive  random  utility  models,  pre¬ 
sented  in  this  section.  Subsequent  sections  describe  the  leading  examples. 


15.5.1.  ARUM 

The  additive  random  utility  model  was  introduced  in  Section  14.4.2  for  binary  out¬ 
comes.  In  the  general  m -choice  multinomial  model  the  utility  of  the  /  th  choice  is 
specified  to  be  given  by 


Uj  =  Vj+Sj,  j  =  1,2,  (15.22) 

where  V)  denotes  the  deterministic  component  of  utility  and  Sj  denotes  the  random 
component  of  utility.  For  the  /th  individual  usually  V(/  =  x--/3  or  V,j  =  x'/3;,  though 
more  structural  analysis  may  specify  direct  or  indirect  utility  functions  used  in  con¬ 
sumer  demand  theory.  For  notational  simplicity  we  suppress  the  individual  subscript  / 
in  the  following. 

The  chosen  alternative  is  that  with  the  highest  utility,  so  that 

Pr[v  =  j]  =  Pr [Uj  >  14,  all  k  /  j]  (15.23) 

=  Pr[U,  -  Uj  <  0,  all  k  /  j] 

=  Pr [sk-Sj  <Vj-Vk,  all  k  ^  j] 

=  Pr| %j<-Vkj,  all*/./]. 

where  the  tilda  and  second  subscript  j  denotes  differencing  with  respect  to  reference 
alternative  j . 

Different  multinomial  models  can  be  generated  by  different  assumptions  about  the 
joint  distribution  of  the  error  terms.  These  models  are  valid  statistically,  with  proba¬ 
bilities  summing  to  one.  Additionally,  they  are  consistent  with  the  standard  economic 
theory  of  decision  making. 

For  example,  consider  the  expression  for  Pr[y  =  1]  in  a  three-choice  model.  Using 
the  last  equality  in  (15.23)  and  defining  £31  =  £3  —  e  1  and  £21  =  £2  —  £1  we  have 

Pr[v=  l]  =  Pr[e21  < -V21,  e31  < -V31]  (15-24) 

musn)ds2lds3u 

which  is  a  bivariate  integral  that  generally  does  not  have  an  analytical  solution.  More 
generally,  an  m -choice  model  involves  an  (m  —  l)-variate  integral  that  may  or  may  not 
yield  a  closed-form  solution  for  Pr[v  =  /  ]. 

In  general  all  the  errors  s\,  £2, . . . ,  sm  may  be  correlated  across  choices.  Some  co- 
variance  restrictions  are  necessary,  however,  as  the  model  is  identified  only  up  to  the 
(m  —  1)  error-difference  pairs  (see  the  last  equality  in  (15.23)),  and  additionally  one 
variance  needs  to  be  specified  since  the  Uj  are  only  determined  up  to  scale. 
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15.5.2.  Different  Unordered  Multinomial  Models 

Different  unordered  multinomial  models  arise  from  different  assumptions  on  the  joint 
distribution  of  £1,  £2, . . . ,  sm.  Analysis  is  simplest  if  the  error  assumptions  lead  to  a 
closed-form  solution  for  the  choice  probabilities.  However,  in  many  applications  these 
assumptions  are  felt  to  be  too  restrictive. 

The  computationally-intesive  methods  summarized  in  Chapter  12  permit  estimation 
even  if  there  is  no  closed-form  solution  for  the  choice  probabilities.  Sections  15.7.2  and 
15.8.2  present  multinomial  examples  of  these  methods. 


Type  1  Extreme  Value  Errors 

We  first  assume  that  the  errors  e;  are  iid  type  1  extreme  value,  with  density 

/(£,-)  =  e~Sj  exp(— <Te'),  j  =  1,2,...,  m.  (15.25) 

The  properties  of  this  density  were  given  in  Section  14.4.2,  where  it  was  shown  to  lead 
to  a  logit  model  in  the  binary  outcome  case. 

For  multinomial  outcomes  modelled  using  the  ARUM  with  type  I  extreme  value 
errors  it  can  be  shown  that  (15.23)  yields 


Pr[>'  =  j] 


e 


Vi 


ev1+ev2  +  ...  +  ev„,  ■ 


(15.26) 


This  is  the  CL  model  when  Vj  =  x'j/3  and  the  MNL  model  when  V)  =  x' (3  j.  The  result 
can  be  obtained  either  by  integration  and  simplification  similar  to  the  binary  case  (see 
Section  14.8),  or  as  a  special  case  of  the  nested  logit  result  derived  in  Section  15.6. 
Thus  conditional  and  multinomial  logit  models  can  be  obtained  from  an  ARUM. 

The  assumption  that  the  errors  sj  are  independent  across  alternatives  j  is  too  restric¬ 
tive  as  it  is  likely  to  be  violated  if  two  alternatives  are  similar.  For  example,  suppose 
alternatives  1  and  2  are  similar.  A  low  value  of  Si  (i.e.,  large  and  negative)  leads  to 
overprediction  of  the  utility  of  alternative  1.  We  then  also  expect  to  overpredict  the 
utility  of  alternative  2,  so  that  £2  also  takes  a  low  value.  Since  low  values  of  s\  and  so 
tend  to  go  together,  and  similarly  for  high  values,  the  errors  must  be  correlated.  This 
is  another  way  of  viewing  the  “red  bus-blue  bus”  problem,  and  it  is  a  manifestation  of 
a  failure  of  the  logit  assumption  of  independence  of  irrelevant  alternatives. 

The  generalized  extreme  value  model  and  the  nested  logit  model  (see  Section  15.6) 
relax  the  assumption  that  the  extreme  value  errors  are  independent  across  choices.  The 
errors  are  grouped  with  independence  across  groups  but  correlation  permitted  within 
groups.  Closed-form  solutions  are  then  available  for  the  choice  probabilities.  Although 
these  models  are  richer  than  the  MNL  model,  the  special  case  of  no  correlation  within 
groups,  in  many  applications  the  grouping  of  errors  can  be  somewhat  arbitrary. 

The  random  parameters  logit  model  (see  Section  15.7)  introduces  additional  ran¬ 
domness  into  the  MNL  model  that  induces  correlation  of  utilities  across  alternatives. 
This  is  an  example  of  a  generalized  random  utility  model  (see  Section  15.7.3). 
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Normally  Distributed  Errors 

The  multinomial  probit  model  (see  Section  15.8)  arises  if  the  errors  s\, . . . ,  sm  are 
assumed  to  be  joint  normal  distributed.  This  error  assumption  is  a  more  natural  starting 
point  than  one  of  type  1  extreme  value.  It  permits  a  very  rich  correlation  structure,  at 
the  expense  of  the  need  to  use  numerical  or  simulation  methods  that  accommodate  an 
(m  —  l)-variate  normal  integral. 


15.5.3.  Consistency  with  Random  Utility  Models 

It  is  always  possible  to  present  an  analytical  expression  for  choice  probabilities  that  lie 
between  zero  and  one  and  that  sum  over  alternatives  to  one.  A  quite  general  example 
is  the  universal  logit  model  in  (15.13).  The  econometrics  literature  has  placed  great 
emphasis  in  restricting  attention  to  multinomial  models  that  are  consistent  with  maxi¬ 
mization  of  a  random  utility  function.  This  is  similar  to  restricting  analysis  to  demand 
functions  that  are  consistent  with  consumer  choice  theory. 

Let  V  =  (Vj, . . . ,  Vm).  From  Borsch-Supan  (1987,  p.  19),  a  set  of  choice  probabil¬ 
ities  Pj(  V),  j  =  I .... ,  m,  is  compatible  with  maximization  of  an  ARUM  if 

1.  Pj(V)  >  0,  J2"j=i  Pj(V)  =  1,  and  pj(V)  =  pj{V  +  a)  for  all  a  e  R\ 

2.  dPj(V)/dVk  =  dpk(V)/dVj ;  and 

3.  d^-UpjiVydVi . . .  [3V,] . . .  3 Vm  >  0,  where  the  square  bracket  denotes  a  term  to  be 
dropped  out. 

These  conditions,  due  to  Williams  (1977),  Daly  and  Zachary  (1979),  and  McFad- 
den  (1981),  ensure  in  turn  (1)  well-behaved  probabilities  and  translation  invariance; 
(2)  integrability  of  pj  similar  to  the  Slutsky  condition;  and  (3)  that  the  distribution 
function  of  the  errors  in  the  corresponding  ARUM  has  a  proper  (nonnegative)  density 
function. 


15.5.4.  Welfare  Analysis 

A  major  advantage  of  using  a  multinomial  model  that  is  a  random  utility  model  is  that 
it  permits  welfare  analysis.  Then  one  can  place  a  dollar  value  on  the  effect  of  changing 
one  or  more  of  the  determinants  of  choice,  such  as  price  or  time  cost  of  travel  in 
transportation  mode  choice. 

Standard  welfare  analysis  uses  compensating  variation  or  equivalent  variation. 
The  deterministic  component  of  utility  in  (15.22)  is  specified  as  the  indirect  utility 
function 


Vj  —  V(I  —  Pj,  xj),  (15.27) 

where  I  denotes  income,  /;,  is  the  price  of  the  j  th  alternative,  and  x;  are  characteristics 
associated  with  the  /th  alternative.  For  notational  simplicity  the  unknown  regression 
parameters  (3  are  suppressed.  Then  utility  of  alternative  j  is 

Uj  =  !/(/-  Pj,Xj,Sj )  =  V (/  -  pj,Xj)  +  Sj.  (15.28) 
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Suppose  we  change  the  characteristics  from  x'.  to  x".  Then  compensating  variation 
C  V  is  the  change  in  income  needed  to  hold  utility  at  its  initial  level,  so  that  the  highest 
utility  level  attainable  with  income  I  and  characteristics  x'.  must  equal  the  highest  level 
attainable  with  income  ( /  —  CV)  and  characteristics  x".  Thus  compensating  variation 
C  V  is  implicitly  defined  as  the  solution  to 

max  U(I  —  Pj ,  x': ,  Sj )  =  max  U(I  —  CV  —  pj ,  x",  Sj ).  (15.29) 

y=l,...  ,m  J  7=1,...  ,m  J 

As  an  example,  consider  a  two-choice  model  where  Uj  =  I  +  xj  +  e;,  j  =  1,2, 
and  the  scalar  Xj  changes  from  x-  to  x".  Then  there  are  four  possibilities.  If  alterna¬ 
tive  1  is  chosen  before  and  after  then  CV  =  (x"  —  x[),  since  then  U"  =  I  —  C V  + 
x"  +  s\  =  I  +  x[  +  Si  =  U[.  Similarly,  if  alternative  2  is  chosen  before  and  after  then 
CV  =  (x'{  —  x^).  If  switching  occurs  from  alternative  1  to  alternative  2  then  U"  =  U[ 
implies  I  —  CV  +  x'{  +  £2  =  I  +  x[  +  £1,  which  implies  CV  =  x'{  —  x[  +  £2  —  £\. 
Similarly,  if  switching  occurs  from  alternative  2  to  alternative  1  then  CV  =  x"  —  x\  + 
s  1  —  £2.  More  generally,  for  m  choices  the  compensating  variation  in  this  simple  exam¬ 
ple  is  CVjk  =  V[r  —  Vj  +  Sk  —  Sj  if  the  change  in  x  leads  to  a  change  from  alternative 
j  to  alternative  k. 

The  compensating  variation  depends  on  observables  (/,  pj,  and  x;),  parameters  that 
can  be  estimated,  and  on  unobservable  errors  Sj.  The  unobservables  are  eliminated  by 
computing  the  expected  compensating  variation  E[CV],  which  involves  integrating 
over  Sj.  From  the  preceding  example  it  should  be  clear  that  this  integration  can  be 
quite  difficult.  Dagsvik  and  Karlstrom  (2004)  provide  quite  general  results,  discussed 
further  in  Section  15.6.5. 

For  some  models  there  is  no  analytical  solution  for  E [CV].  Then  one  instead  needs 
to  numerically  integrate  over  Sj  the  function  for  CV  defined  in  (15.29).  From  Sec¬ 
tion  12.3.2  this  integral  can  be  simulated  in  the  following  way: 

1.  At  iteration  .v  draw  es  from  the  distribution  of  £  =  (ei, . . . ,  sm). 

2.  Calculate  CVS  from  max  U(I  —  pj ,  x'- ,  e' )  =  max  U(I  —  CVS  —  pj,  x",  e'). 

3.  Repeat  steps  1  and  2  S  times. 

4.  Estimate  E[CV]  by  5”1  £s=i  CV' . 

This  method  yields  E[CV]  for  each  individual  in  the  sample.  Averaging,  possibly 
with  weighting,  provides  a  population  estimate.  Application  to  the  GEV  model  is  dis¬ 
cussed  in  Section  15.6.5. 


15.6.  Nested  Logit 

The  nested  logit  is  the  most  analytically  tractable  generalization  of  the  multinomial 
models.  It  is  the  ideal  model  to  use  when  there  is  a  clear  nesting  structure,  but  not  all 
multinomial  choice  applications  have  an  obvious  nesting  structure. 
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15.6.1.  Generalized  Extreme  Value  Model 

McFadden  (1978)  proposed  a  quite  general  class  of  model  based  on  the  assumption  that 
the  joint  distribution  of  the  errors  is  the  generalized  extreme  value  (GEV)  distribution 
with  joint  distribution  function 


F(sus  2, ...,  £,„)  =  exp[— G(e_£l,  e~Sl, ...,  e~£™)],  (15.30) 


where  the  function  G{Y\,  Y2,  . . . .  Ym)  is  specified  to  satisfy  a  number  of  assumptions 
including  nonnegativity,  homogeneity  of  degree  one,  mixed  partial  derivatives  that  are 
continuous  and  nonpositive  for  even  order  and  nonnegative  for  odd  order,  and  limy  ^oo 
G{Y\,  Y2, ,  Ym)  =  00.  These  assumptions  ensure  that  the  joint  distribution  and  re¬ 
sulting  marginal  distributions  are  well  defined  and  that  probabilities  sum  to  one. 

If  the  errors  are  GEV  distributed  then  an  explicit  solution  for  the  probabilities  in  the 
random  utility  model  (15.22)  can  be  obtained,  with 


pj  =  Pr[y  =  j]  =  ev> 


Gj(e~v\  e~Vl, ...,  e~Vm) 
G[e-v\e-^,...,e~v~)  ’ 


(15.31) 


where  Gj(Yu  Y2, . . . ,  Ym)  =  dG(Yu  Y2, . . . ,  Ym)/dYj  (see  McFadden,  1978,  p.  81). 

A  wide  range  of  models  can  be  obtained  by  different  choices  of  G(Y\.  Y2, ,  Ym ). 
The  MNL  model  is  obtained  if  G(Ti,  Y2, . . . ,  Ym )  =  Y2"k'=i  hence  the  MNL  model 
is  a  GEV  model.  The  other  widely  used  GEV  model  is  the  nested  logit  model. 


15.6.2.  Nested  Logit  Model 

The  nested  logit  model  breaks  decision  making  into  groups.  A  simple  example  is  to 
consider  choice  of  college,  where  people  first  decide  whether  to  go  to  a  two-year  or 
four- year  college,  and  then  within  each  of  these  paths  whether  to  go  to  a  public  or 
private  college.  The  situation  is  depicted  as  follows: 

College 


2  year  4  year 

/  \  /  \ 

Private  Public  Private  Public 

The  errors  in  a  random  utility  model  are  permitted  to  be  correlated  for  each  option 
within  the  two-year  and  four-year  groups,  but  they  are  uncorrelated  across  these  two 
groups. 

More  generally,  we  suppose  that  at  the  top  level  there  are  J  limbs  to  choose  from. 
The  /tli  limb  has  Kj  branches  numbered  j  1,  . . . ,  jk, . . . ,  jKj.  The  utility  for  the  al¬ 
ternative  in  the  yth  of  J  limbs  and  kth  of  Ks  branches  is  then 

Ujk  =  Vjk+sjk,  k  =  1,2,...,  Kj,  ;  =  1,2,...,/,  (15.32) 
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where  for  an  m-choice  model  K\  +  •  •  •  +  Kj  =  m.  This  is  illustrated  as  follows: 

Root 


limb  1 

/  1 

\ 

limb  j 

l 

limb  J 

/  l 

\ 

branch  1  • •  • 

branch  K\  •  ■ 

•  •  branch  k  •  ■ 

■  •  branch  1  •  •  • 

branch  K j 

Vn+Sn  ••• 

V\Ki+S\Ki  ' 

Vjk+Gjk 

••• 

VjKj+SjKj 

There  can  be  additional  levels,  with  the  third  level  being  a  twig,  etc.  For  notational 
simplicity  we  present  results  for  a  two-level  model. 

For  any  model  with  this  nesting  pjk,  the  joint  probability  of  being  on  limb  j  and 
branch  k,  can  be  factored  as  pj,  the  probability  of  choosing  limb  j,  times  p^j,  the 
probability  of  choosing  branch  k  conditional  on  being  on  limb  j .  Thus 


Pjk  =  Pj  x  pk  u. 

The  nested  logit  model  of  McFadden  (1978)  arises  when  the  error  terms  Sjk  have 
the  GEV  joint  cumulative  distribution  function 

F(e)  =  exp[— G(e_£n, . . . ,  . . .  ;e~ejl, e~£jKJ)]  (15.33) 

for  the  following  particular  specification  of  the  function  G(-): 

J  (  K, 

G(Y)  =  G(Tii, . . . ,  Y\kx,  ■  ■  ■ ,  Yji,  ■ .  ■ ,  YjKj)  =  E  E 

7=1  \*=1 

The  parameter  f>j  is  a  function  of  the  correlation  between  s  -p  and  sji  but  does 
not  exactly  equal  the  correlation  parameter.  In  fact  pj  can  be  shown  to  equal 
J 1  —  Cor  [sjk,  Sji],  so  pj  is  inversely  related  to  the  correlation  and  we  expect  0  < 
Pj  <  1.  The  choice  pj  =  1  corresponds  to  independence  of  Sjk  and  e;/  and  leads  to 
the  MNL  model.  We  call  the  parameters  pj  the  scale  parameters,  as  they  scale  re¬ 
gression  parameters  in  the  models  considered  in  the  following. 

Notation  varies  considerably  across  authors.  McFadden  (1978)  and  Maddala  (1983) 
instead  define  this  cdf  in  terms  of  a  j  =  1  —  pj,  called  the  dissimilarity  parame¬ 
ter.  Others  use  \ij  =  1  /pj.  Many  authors  model  alternative  i j  for  the  nth  individual 
whereas  we  model  alternative  jk  and  reserve  i  for  the  zth  individual. 

The  outcome  indicator  variables  yjk  equal  one  if  alternative  jk  is  chosen  and 
zero  otherwise.  Then  from  (15.32),  pjk  =  Pr \yjk  =  1]  =  Pr[  Ujk  >  f//m,  for  all  /,  m  ]. 
Closed-form  solutions  for  the  probabilities  pp- ,  as  a  function  of  the  Vjk  and  pj,  are 
derived  in  Section  15.12.3.  These  are  then  evaluated  for  the  particular  deterministic 
utility  function 


(15.34) 


Vjk  =  z'ja  +  x'jkflj,  k  Kj,  y'  =  l, - J,  (15.35) 

where  zj  varies  over  limbs  only  and  Xjk  varies  over  both  limbs  and  branches.  The 
parameters  a  and  (3j  are  called  regression  parameters. 
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The  GEV  model  (15.32)— (15.35)  yields  the  nested  logit  model 

exp  (z 'jOL+pjIj)  exp  (x'jkPj/pj) 

Pjk  -  Pj  x  pkU  -  —j - x  —  V  (15.36) 

Em=l  eXP  +  Pmlm)  £,='j  eXP  \*jlPj/Pj) 
see  Section  15.12.3,  where 

!j  =  ln  ^exp(x',/37/p;)j  (15.37) 

is  called  the  inclusive  value  or  the  log-sum.  One  attraction  of  the  nested  logit  model 
is  that  the  probabilities  p,  and  pj\t  are  essentially  of  conditional  logit  form. 

The  preceding  results  are  for  regressors  that  vary  across  alternatives.  The  algebra 
can  be  adapted  to  alternative-invariant  regressors  Vjk  =  z'otj  +  x'/3:k,  with  a  normal¬ 
ization  of  one  of  the  /3jk.  Algebraically  all  that  is  needed  is  a  partition  Vjk  =  A  j  +  Bjk, 
where  Aj  pertains  to  the  limb  and  Bjk  pertains  to  both  limb  and  branch. 


15.6.3.  Estimation  of  Nested  Logit 


For  the  rth  observation  we  observe  K\  +  •  •  •  +  K j  outcomes  y,^,  where  y,^  =  1  if 
alternative  jk  is  chosen  and  is  zero  otherwise.  Then  pjjk  =  piky  x  p,t  and  the  density 
for  one  observation  y,  =  (y,n, . . . ,  y,  jk ,)  can  be  compactly  expressed  as 


/<>.)  -  n  n  [p*u x  = n  ( ^  n 


pmj 


.yijk 


./=  !  k=\ 


j= 1 


k=  1 


where  yl;  =  yiji  equals  °ne  if  limb  j  is  chosen  and  equals  zero  otherwise. 
The  density  for  the  sample  is  ]~[/Li  / (y / )  -  The  FIML  estimator  maximizes 


N  J  N  J  Kj 

lnL  =  £E  >’<;  1,1  PU  +  EEE  yijk^Pik\j,  (15.38) 

i=l  7=1  i=  1  7=1  k=  1 

with  respect  to  parameters  a,  /3  •,  and  pj. 

An  alternative,  less-efficient  estimation  is  the  sequential  estimator  or  LIML  esti¬ 
mator  that  exploits  the  partitioning  of  pjk  into  the  product  of  pk\  j  and  pj.  The  first 
stage  bases  estimation  on  the  second  term  of  the  right-hand  side  of  (15.38),  which 
from  (15.36)  is  a  conditional  logit  model  with  estimated  parameter  /3  ■/ pj.  The  second 
stage  bases  estimation  on  the  first  term  of  the  right-hand  side,  which  from  (15.36)  is  a 
conditional  logit  model  with  added  regressor  7,;-,  an  estimate  of  the  inclusive  value  in 
(15.37)  that  can  be  computed  using  the  first-stage  parameter  estimates.  The  a  and  p  ■ 
are  obtained  directly  from  the  second  stage,  whereas  /3  ■  equals  p  •  times  the  first-stage 
estimate  /3  ■  / pj . 

This  sequential  estimator  is  less  efficient  than  the  FIML  estimator,  and  at  the  second 
stage  the  usual  CL  standard  errors  understate  the  true  standard  errors  of  the  sequential 
estimator  since  they  do  not  allow  for  the  estimation  error  in  computing  the  inclusive 
value.  McFadden  (1981)  gives  the  formula  for  correct  standard  errors,  or  the  bootstrap 
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can  be  used.  The  sequential  alternative  estimator  was  originally  proposed  at  a  time 
when  even  conditional  logit  model  estimation  was  challenging.  Now  it  is  relatively 
simple  to  code  the  likelihood  function,  so  it  is  best  to  use  FIML.  Sequential  estima¬ 
tion  is  potentially  useful  to  provide  starting  values  as  the  FIML  log-likelihood  is  not 
globally  concave. 

As  an  example  we  applied  the  nested  logit  model  to  the  data  of  Section  15.2.  The 
nesting  structure  was  shore  or  boat  fishing  at  the  higher  level,  with  lower  levels  beach 
or  pier  (for  shore  fishing)  and  private  or  charter  (for  boat  fishing).  The  regressors  x7* 
in  (15.36)  that  vary  at  the  lower  level  were  price  (P)  and  catch  rate  (C).  The  regressors 
z j  at  the  higher  level  that  vary  across  shore  or  boat  were  an  indicator  variable  d  equal 
to  one  if  shore  fishing  and  d  x  7,  income  interacted  with  the  shore  fishing  indicator. 
Estimation  by  conditional  logit  (corresponding  to  p\  =  pi  =  1)  yielded  a  fitted  model 
with  InL  =  —1252,  as  expected  smaller  than  the  log-likelihood  for  the  similar  but 
less  restricted  model  given  in  the  last  column  of  Table  15.2.  FIML  estimation  of  the 
corresponding  nested  logit  model,  with  p\  and  now  free  to  vary,  led  to  a  much  higher 
log-likelihood  model  and  rejection  of  the  more  restricted  conditional  logit  model  using 
the  x2(2)  likelihood  ratio  test  statistic. 


15.6.4.  Discussion 

The  main  limitation  of  the  nested  logit  model  is  that  not  all  choice  problems  lend  them¬ 
selves  to  an  obvious  nesting  structure.  One  can  still  select  the  optimal  nesting  scheme 
using  likelihood  ratio  tests,  where  appropriate,  or  Akaike’s  information  criteria.  How¬ 
ever,  the  resulting  scheme  does  not  always  accord  with  a  priori  expectations. 

Another  practical  issue  is  that  consistency  of  the  nested  logit  model  with  choice 
from  an  ARUM  requires  that  the  three  conditions  in  Section  15.5.2  are  satisfied.  The 
third  of  these  conditions  is  satisfied  globally  if  0  <  pj  <  1 ,  and  with  more  than  two 
levels  of  nesting  it  is  additionally  required  that  p  at  higher  levels  of  the  nest  structure 
does  not  exceed  p  at  lower  levels  of  nesting.  In  practice  it  is  possible  to  obtain  estimates 
of  pj  outside  the  unit  interval.  One  can  still  use  the  model,  as  the  choice  probabilities 
are  proper,  but  the  model  may  no  longer  come  from  an  ARUM.  Borsch-Supan  and 
others  have  considered  local  identification  conditions  under  which  the  nested  logit 
model  may  be  consistent  with  ARUM  even  if  pj  lie  outside  the  unit  interval.  It  can 
also  be  useful  to  do  a  grid  search  over  pj  to  constrain  pj  to  the  unit  interval  and  to 
enumerate  the  reduction  in  log-likelihood,  if  any,  caused  by  doing  so. 

The  nested  logit  model  defined  in  (15.36)  and  (15.37)  was  proposed  by  McFadden 
(1978),  who  derived  it  as  a  GEV  model.  An  earlier  variant  of  the  nested  logit 
model  was  similar  to  (15.36)  and  (15.37),  except  that  exp(x^7/3 j/ Pj)  was  replaced  by 
exp(x^7/37  ).  This  had  an  alternative  derivation  as  a  natural  extension  of  the  CL  model, 
since  CL  is  the  special  case  of  (15.36)  and  (15.37)  with  pj  =  1.  See  McFadden  (1978, 
p.  79),  Maddala  (1983,  p.  70),  and  Greene  (2003,  p.  726). 

It  is  very  important  to  note  that  the  two  variants  differ  if  pj  differs  across  alter¬ 
natives;  see  Koppelman  and  Wen  (1998)  and  Train  (2003,  p.  88).  Some  early  studies 
obtained  sequential  estimates  that  differed  substantially  from  FIML  estimates,  casting 
doubt  on  the  robustness  of  the  nested  logit  model.  However,  in  some  of  these  studies 
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the  different  estimators  were  being  applied  to  different  variants  of  the  nested  logit 
model.  Furthermore,  even  today  different  packages  estimate  different  variants  of  the 
model. 

The  nested  logit  model  can  be  extended  to  higher  levels  of  alternatives  (or  nesting). 
For  example,  Goldberg  (1995)  has  five  levels:  (1)  buy  any  car;  (2)  buy  a  new  car  given 
yes  to  1;  (3)  which  of  nine  classes  of  car  was  purchased  given  yes  to  2;  (4)  foreign 
or  domestic;  (5)  model.  An  added  attraction  if  some  nests  have  numerous  choices  is 
that  it  is  sufficient  to  base  estimation  on  a  fixed  or  randomly  selected  subset  of  the 
alternatives  (see  McFadden,  1978). 


15.6.5.  Welfare  Analysis 

Welfare  analysis  for  the  ARUM  was  presented  in  Section  15.5.4.  In  general  there  is  no 
solution  for  E[C V],  the  expected  compensating  variation. 

Remarkably,  for  GEV  models  that  are  linear  in  income,  V(I  —  pj ,  x;)  =  a( I  — 
pj)  +  f(xj),  McFadden  (1995)  and  earlier  workers  show  that  there  is  an  explicit 
solution 

E[C  V]  =  ^  (in  G  (ev", ev-j  -  In  G  (ev' , . . . ,  , 

where  the  function  G(-)  for  the  GEV  distribution  is  defined  in  (15.34),  and  V'  and  V” 
are  the  before  and  after  values  of  the  deterministic  component  of  utility. 

For  GEV  models  with  income  appearing  nonlinearly,  however,  there  is  no  explicit 
solution.  Then  one  approach  is  the  simulation  method  given  in  Section  15.5.4.  For  a 
multinomial  logit  model  this  is  simple  as  it  is  easy  to  draw  extreme  value  errors  using 
the  transformation  method  of  Section  12.8.2  -  draw  u  from  the  uniform  on  (0,  1)  and 
then  set  s  =  —  ln(—  \n(u)).  For  a  more  general  nested  logit  model,  however,  it  is  diffi¬ 
cult  to  randomly  draw  from  a  GEV  distribution  even  as  simple  as  the  bivariate  extreme 
value.  McFadden  (1995)  proposed  using  the  MCMC  with  the  Metropolis-Hastings  al¬ 
gorithm  (see  Section  13.5).  Herriges  and  Kling  (1999)  give  an  excellent  summary  of 
this  simulation  method  with  application  to  nested  logit  models  for  the  fishing  data  of 
Section  15.2,  using  various  indirect  utility  functions  including  the  translog. 

More  recently,  Dagsvik  and  Karlstrom  (2004)  show  that  although  there  is  no  explicit 
solution  for  E [CV\  in  the  GEV  model  if  income  enters  nonlinearity,  it  is  analytically 
possible  to  reduce  E [CV\  to  a  one-dimensional  integral.  Computing  this  integral  us¬ 
ing  Gaussian  quadrature  will  be  much  simpler  than  employing  the  afore-mentioned 
simulation  method. 


15.7.  Random  Parameters  Logit 

The  random  parameters  logit  model  provides  a  simple  way  to  generalize  the  MNL 
or  CL  model  to  permit  the  utilities  of  each  alternative  to  be  correlated.  The  model  is 
perhaps  the  leading  microeconometrics  example  of  a  random  parameters  model  for 
cross-section  data. 
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15.7.1.  Random  Parameters  Logit  Model 

The  random  parameters  logit  (RPL)  model  specifies  the  utility  to  the  z'th  individual 
for  the  j  th  alternative  to  be 

Uij  =  x'jjPi  +  Sij,  j  =  1,  2, ....  777,  (15.39) 

where  ei;  are  iid  extreme  value,  as  for  the  CL  model,  but  additionally  permits  the 
parameters  (3 ,■  to  be  random.  The  most  common  assumption  is  that 

(3 (15.40) 

One  variation  is  to  use  the  log-normal  rather  than  normal  distribution  for  parameters 
whose  sign  is  known  a  priori.  This  model  is  also  called  a  mixed  logit  model,  using 
terminology  borrowed  from  the  panel  setting  for  models  with  random  parameters.  By 
reexpressing  the  MNL  model  as  a  CL  model,  the  results  that  follow  also  cover  a  ran¬ 
dom  parameters  MNL  model. 

The  model  can  be  rewritten  as 


Uu  =  xL/3  +  Vjj , 
vu  =  xLuj  +  su, 

where  u,  ~  A/"[0,  S^].  Then  Cov[uy,  v,&]  =  xJ-X^x,-*,  j  /  k,  so  the  introduction 
of  random  parameters  has  the  attractive  property  of  inducing  correlation  across 
alternatives. 

In  most  applications  the  covariance  matrix  is  specified  to  be  diagonal,  and  addi¬ 
tionally  some  of  the  diagonal  entries  may  be  set  to  zero.  Then  the  number  of  covariance 
parameters  to  estimate  equals  the  number  of  components  of  / 3,  that  are  specified  to  be 
random. 

As  an  example,  consider  a  mixed  CL  model  with  scalar  regressor  and  parameters 

and  a,.; .  Suppose  the  parameter  estimates  are  /;  =  2.0  with  standard  error  0.5  and 
ajj  =  1.0  with  standard  error  0.2.  Then  the  null  hypothesis  of  constant  parameter,  that 
is,  =  0,  is  strongly  rejected  since  t  =  1.0/0. 2  =  5.0.  The  effect  on  Pr[y,  =  j  ]  of 
an  increase  in  xy  differs  across  individuals  and  is  positive  for  about  97.5%  of  the 
sample,  since  it  is  estimated  that  /),  ~  A/"[2.0,  1 .0].  For  an  application  that  emphasizes 
interpretation  of  estimated  coefficients,  see  Revelt  and  Train  (1998). 

The  industrial  organization  literature  considers  aggregation  over  consumers  of 
models  similar  to  the  RPL  model  to  estimate  demand  parameters  using  market- 
level  data.  See,  for  example,  Berry  (1994)  and  Nevo  (2001),  and  also  Allenby  and 
Rossi  (1991). 


15.7.2.  Estimation  of  Random  Parameters  Logit 

In  the  linear  regression  model  with  random  parameters,  OLS  estimation  yields  esti¬ 
mates  of  the  means  / 3  that  are  consistent  though  inefficient.  In  a  nonlinear  model, 
however,  estimators  that  fail  to  control  for  the  randomness  of  the  parameters  will  be 
inconsistent.  Thus  the  usual  conditional  logit  MLE  will  be  inconsistent  if  the  dgp  is 
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given  by  (15.39)  and  (15.40).  Instead,  ML  estimation  must  explicitly  account  for  the 
stochastic  process  for  /3, . 

If  /3,  were  known,  so  that  the  only  source  of  randomness  is  e!;-,  a  CL  model  is 
obtained  with  probability  ptJ  =  /  Yl?=i  ex'1^'  ■  Since  /3,  is  in  fact  random  we  need 

to  integrate  out  this  randomness.  This  yields 

PU  =  Myi  =  n  =  J  E (15.41) 

where  the  integral  is  multidimensional  and  </>(/3,-|/3,  E  ^ )  denotes  the  multivariate  nor¬ 
mal  density  for  (3t  with  mean  (3  and  variance  Y,p. 

The  MLE  maximizes  In  L,v  =  Yl"j=i  yij  ln  Pi]  "'ll!1  respect  to  (3  and  E^.  The 
challenge  is  that  there  is  no  closed-form  solution  for  the  integral,  whose  dimension  is 
given  by  the  number  of  components  of  (3t  that  are  random,  with  non- zero  variance. 
Estimation  is  therefore  by  simulation  methods. 

One  approach  is  to  approximate  using  the  direct  simulator  (see  Section  12.4.1). 
This  replaces  the  integral  (15.41)  by  the  average  of  S  evaluations  of  the  integrand 
at  random  draws  of  /3,  from  the  A f[(3,  E  |  distribution.  The  MSL  estimator  then 
maximizes 


N  m 

lnLiv(/3,  E^)  =  EE  ytj ln 

1=1  7=1 


I  S 
1  \  ^ 


■E 


x'  3{s) 
eiJPi 


=i  2^i=i 


x' 

exuPi 


(15.42) 


where  /3\s\  s  =  I ..... .S’,  arc  random  draws  from  the  density  <t>((3t\ (3,  E^).  Since  (3 

and  E^  are  unknown,  this  summation  is  embedded  in  an  iterative  procedure  with 
evaluation  at  f3{,  )  and  E 1  at  the  rth  round.  Consistency  requires  that  S  oo  as 
well  as  IV  — >•  oo  and  that  \/N /S  — oo  (see  Section  12.4.3).  Methods  for  speeding 
up  computation  include  use  of  Halton  sequences  (see  Section  12.7.4)  and  alternative 
simulators. 

An  alternative  estimator  uses  Bayesian  methods  with  relatively  flat  priors.  Train 
(2001,  2003)  specifies  hierarchical  priors  with  f3  ~  M[(3* ,  £2*],  where  £2*  is  assumed 
to  be  large,  and  with  E^  assumed  to  be  inverse-Wishart  distributed  with  degrees  of 
freedom  K  =  dim[/3]  and  scale  parameter  1^.  Rather  than  working  with  the  pos¬ 
terior  for  just  (3  and  E.<j  it  is  computationally  quicker  to  additionally  include  f3h 
i  =  1, . . . ,  N.  Then  (1)  the  conditional  posterior  for  (3 |E^,  /3,  is  normal,  (2)  the  con¬ 
ditional  posterior  for  E^I/3,  f3t  is  inverse  Wishart,  and  (3)  the  conditional  posterior  for 
(3j  |  E/3,  is  (3 ,  which  is  proportional  to  the  integrand  in  (15.41).  Given  these  conditional 
posteriors  estimation  can  be  done  using  a  variation  of  the  Gibbs  sampler  (see  Sec¬ 
tion  13.5.2),  with  the  complication  that  draws  for  the  third  posterior  need  to  use  one 
iteration  of  the  Metropolis-Hastings  algorithm  (see  Section  13.5.4)  because  the  full  set 
of  conditionals  is  not  available.  In  an  application  this  took  similar  computation  time  to 
the  MSL  estimator  and,  given  the  relatively  flat  prior,  yielded  parameter  estimates  and 
standard  errors  that  were  generally  within  10%  of  those  from  MSL  estimation. 
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15.7.3.  Generalized  Random  Utility  Models 

Models  more  flexible  than  multinomial  logit  are  desirable.  In  this  regard  there  is  cur¬ 
rently  great  enthusiasm  regarding  the  random  parameters  logit  model.  McFadden  and 
Train  (2000)  show  that  any  random  utility  model  can  be  approximated  arbitrarily  well 
by  a  mixed  model,  though  this  result  requires  appropriate  choice  of  regressors  and 
mixing  distribution. 

There  is  no  reason  to  restrict  the  random  parameters  approach  to  multinomial  logit 
models.  For  example,  it  may  be  extended  to  nested  logit  models.  Moreover,  additional 
sources  of  randomness  may  be  incorporated,  notably  latent  classes  and  latent  variables. 

To  present  these  extensions  we  begin  with  the  ARUM  (15.22).  This  specifies  the 
utility  to  individual  i  of  the  /  tlr  alternative  to  be  L/,;  =  V)y  (x,- ,  (3)  +  el;,  where  x,  de¬ 
notes  observed  data,  (3  denotes  unknown  parameters,  and  Sjj  denotes  an  error  indepen¬ 
dent  over  i  but  possibly  correlated  over  j .  Assume  that  the  distribution  of  e,;  is  such 
that  (15.23)  yields  a  closed-form  solution  for  the  choice  probabilities  denoted 


Pij  =  F )(V|(Xj,  /3),  0e), 


where  Y,(x,,  (3)  =  [V,i(x;,  (3),  . . . ,  V„„(x;,  [3) ]  and  6e  denotes  any  unknown  parame¬ 
ters  of  the  distribution  of  e,=  (sn, . . . ,  eim).  Such  a  closed-form  solution  is  possible 
if  £j  has  a  GEV  distribution  with  special  cases  leading  to  multinomial  logit  and  nested 
logit  models. 

A  more  general  model  introduces  additional  randomness  into  this  model.  First,  the 
previously  deterministic  part  of  utility  becomes  Vy  =  V,y  (x,- ,  £, ,  (3).  Then  assuming 
that  Sj  is  such  that  a  closed-form  solution  for  the  probabilities  exist  conditional  on  , 
unconditionally 

PU  =  J  Fj(Vi(Xi,  p ),  0e)m\0^„  (15.43) 

where  /(£|0£)  denotes  the  density  of  £.  The  RPL  model  is  an  example  with  V,-;  = 
x'jj/3  +  x'^,,  where  is  A/"[0,  S]  and  is  motivated  via  a  random  parameters  argument. 
Flowever,  can  also  be  introduced  as  an  additional  disturbance  term  or  as  a  relevant 
latent  variable.  Second,  individuals  may  be  assumed  to  come  from  one  of  C  latent 
classes;  see  Section  18.5  for  a  duration  model  example  and  Swait  (2003)  for  a  GEV 
example  of  latent  class  or  finite  mixtures  models.  If  / 3  and  Ge  vary  by  class  then  (15.43) 
becomes  unconditionally 

PU  =  t 

c=  1 


/ 


FjCVite, &,/?),  0cE)AZi\ 0OdZ, 


7Tr 


(15.44) 


where  jtc  denotes  probability  of  membership  in  the  cth  class  and  typically  c  =  2  or 
c  =  3.  The  MSL  estimator  then  maximizes 


N  m 

lnWCS,  Up)  =  EE  yij  In 


i=i  j= i 


1 

S 


EEf'4v'(xi.e./3c).«9 


^=1  C=  1 


where  denotes  the  .vth  draw  from  /(£;  |0^).  Kamakura  and  Wedel  (2004)  estimate  a 
finite  mixtures  MNL  model  using  Bayesian  methods. 


515 


MULTINOMIAL  MODELS 


Figure  15.1:  Generalized  random  utility  model. 


Walker  and  Ben-Akiva  (2002)  call  such  a  model  a  generalized  random  utility 
model.  They  cite  many  articles  with  such  extensions,  consider  the  use  of  stated  pref¬ 
erence  data  to  supplement  revealed  preference  data,  and  provide  a  substantial  empir¬ 
ical  illustration.  Figure  15.1,  derived  from  Walker  and  Ben-Akiva  (2002),  summarizes 
the  various  extensions. 

The  multinomial  modeling  literature  has  been  at  the  forefront  of  developing  and 
estimating  highly  structured  parametric  models  that  incorporate  random  parameters, 
latent  variables,  and  latent  parameters  and  combine  data  from  more  than  one  source. 
These  methods  are  applicable  to  any  type  of  cross-section  data,  not  just  discrete 
outcomes. 


15.8.  Multinomial  Probit 

An  alternative  and  obvious  way  to  introduce  correlation  across  choices  in  the  unob¬ 
served  component  is  to  work  with  normally  distributed  errors.  However,  ML  esti¬ 
mation  is  difficult  as  in  the  most  general  case  an  (m  —  l)-fold  integral  needs  to  be 
calculated. 


15.8.1.  Multinomial  Probit  Model 

The  multinomial  probit  (MNP)  model  is  an  m  -choice  multinomial  model,  with  utility 
of  the  /  th  choice  given  by 

Uj  =  Vj  +  Sj,  j  =  1,  2, . . . ,  m,  ( 15.45) 
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where  the  errors  are  joint  normally  distributed,  with 

e~JV[0,E],  (15.46) 

where  the  m  x  1  vector  e  =  [gj . . .  £,„]'.  Usually  V,  =  x'-/3  or  V)  =  x'/3r 

Different  MNP  models  arise  from  different  specifications  of  the  covariance  matrix 
E.  Some  of  the  off-diagonal  entries  are  specified  to  be  nonzero,  to  permit  correlation 
across  the  errors,  though  some  restrictrictions  need  to  be  placed  on  E.  Note  that  if  the 
errors  are  uncorrelated  the  MNP  still  yields  no  closed-form  solution  for  the  probabili¬ 
ties  and  it  is  easier  to  assume  instead  that  the  errors  are  extreme  value  and  use  the  CL 
or  MNL  models. 

Restrictions  on  E  are  needed  to  ensure  identification.  It  is  clear  from  (15.23)  that, 
for  any  ARUM,  choice  is  determined  by  the  differences  in  utility  or  errors.  Thus  we 
consider  the  difference  U j  —  U  \  between  utility  of  alternative  j  and  that  of  alternative 
1,  chosen  to  be  the  benchmark  alternative.  Bunch  (1991)  demonstrated  that  all  but 
one  of  the  parameters  of  the  covariance  matrix  of  the  errors  sj  —  s\  is  identified;  see 
the  discussion  at  the  end  of  Section  15.5.1.  One  way  to  achieve  this  identification 
is  to  normalize  s\  =  0,  say,  and  then  restrict  one  covariance  element.  For  example, 
if  m  =  2,  set  =  0  so  CTii  =  0  and  a  12  =  0  and  additionally  restrict  022  =  1.  Then 
£2  —  £1  =  £2  ~  A/"[0,  1],  which  is  the  binary  probit  model. 

Additional  restrictions  on  E  or  f3  may  be  needed  for  successful  application.  Keane 
(1992)  demonstrated  that  even  if  assumptions  on  the  error  covariance  are  made  to 
ensure  just-identification,  in  practice  the  parameters  of  the  MNP  model  may  be  highly 
imprecisely  estimated  in  models  with  regressors  that  do  not  vary  with  the  alternative. 
Further  restrictions  on  the  MNP  model  are  then  needed.  This  estimation  imprecision  is 
qualitatively  similar  to  high  multicollinearity  among  regressors  in  a  linear  regression. 
Keane  found  that  exclusion  restrictions  on  the  regressors  (with  one  exclusion  for  each 
utility  index)  work  well.  Alternatively,  and  more  commonly,  further  restrictions  may 
be  placed  on  the  covariance  parameters. 

A  popular  parsimonious  model  for  the  errors  is  the  factor  model 

L 

si  =  vj  +  j  = 

/= 1 

where  Vj  and  £1 , . . . ,  are  iid  standard  normal  and  Cjj  are  weights  called  factor  load¬ 
ings  to  be  estimated.  This  model  can  greatly  reduce  the  number  of  covariance  parame¬ 
ters,  from  m(m  +  l)/2  to  L,  and  requires  an  ( L  +  l)-dimensional  integral.  Numerical 
methods,  usually  Gaussian  quadrature,  can  be  used  for  low  values  of  L,  whereas  sim¬ 
ulation  methods  need  to  be  used  for  larger  L.  For  panel  data  the  random  effects  model 
(see  Section  21.2.1)  can  be  viewed  as  a  factor  model  with  error  uu  =  a,  +  e,r,  and  the 
factor  model  may  be  especially  appropriate  in  a  panel  probit  setting. 

15.8.2.  Estimation  of  Multinomial  Probit 

The  regression  and  error  variance  parameters  are  preferably  estimated  by  ML  with 
log-likelihood  given  in  Section  15.3.2.  The  challenge  is  that  there  is  no  closed-form 
expression  for  the  choice  probabilities. 
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For  a  three-choice  MNP  model 

Pi  =  Pr[y  =  1]  =  f  (  /(?21,£3l)d£2ld£31 

J  —  oo  J  —  oo 

(see  (15.24)),  where  f(soi,  £31)  is  a  bivariate  normal  with  as  many  as  two  free  co- 
variance  parameters  and  V21  and  V31  depend  on  regressors  and  parameters  (3.  This 
bivariate  normal  integral  can  be  quickly  evaluated  numerically.  More  generally,  how¬ 
ever,  an  m -choice  model  requires  numerical  evaluation  of  an  ( m  —  1)- variate  integral. 
A  trivariate  normal  integral  is  the  limit  for  numerical  methods,  limiting  standard  nu¬ 
merical  integration  methods  to  a  four-choice  MNP  model. 

For  larger  models  an  alternative  is  to  use  simulation  methods.  For  simplicity  we 
refer  to  the  three-choice  MNP  model.  One  possibility  is  to  use  the  frequency  sim¬ 
ulator  that  approximates  p  1  by  the  fraction  of  draws  of  (621,631)  that  are  less  than 
(— V21,  —  V31).  From  Section  12.7.1  this  simulator  is  not  smooth  and  it  can  be  very  in¬ 
efficient  (see  Section  12.7.2).  Furthermore,  in  the  current  setting  it  is  possible  that  it 
yields  boundary  values  of  pi  =0  or  1.  In  general  it  is  better  to  use  importance  sam¬ 
pling,  detailed  in  Section  12.7.2.  For  Monte  Carlo  integration  over  a  region  of  the 
multivariate  normal  a  very  popular  importance  sampler  is  the  GHK  simulator,  due  to 
Geweke  (1992),  Hajivassiliou  and  McFadden  (1994),  and  Keane  (1994).  This  recur¬ 
sively  truncates  the  multivariate  normal  pdf.  Compared  to  the  frequency  simulator  it 
is  smooth,  requires  many  fewer  draws  for  alternatives  with  low  probability  of  being 
chosen,  and  is  unlikely  to  have  boundary  problems.  Train  (2003)  provides  a  detailed 
account  of  this  method. 

The  preceding  discussion  considers  evaluation  of  MNP  probabilities  assuming 
knowledge  of  /3  and  E.  In  fact  we  need  to  estimate  /3  and  E.  The  maximum  sim¬ 
ulated  likelihood  estimator  estimator  maximizes 

N  m 

ln£v(/3,  E)  =  ^2  Y  y<J  lnPU’ 

•=  1  7=1 

where  the  are  obtained  using  the  GHK  or  other  simulator.  Consistency  requires  the 
number  of  draws  in  the  simulator  S  00  as  well  as  N  00.  The  method  is  very 
burdensome.  At  the  rth  round  of  an  iterative  procedure  (see  Chapter  10)  the  estimates 
are  (3  and  Elr)  and  the  update  requires  recalculating  /j(  /  ,  which  requires  S  draws  for 
each  of  N  individuals. 

An  alternative  estimation  procedure  is  the  method  of  simulated  moments 
(see  Section  12.5).  From  (15.8)  a  consistent  method  of  moments  estimator  solves 
£T=i (ty  —  Pij  )zi  =  0,  where,  for  example,  z,-  =  x,-.  The  corresponding  MSM 
estimator  of  (3  and  E  solves  the  estimating  equations 

N  m 

Y,  -  °> 

1=1 3=1 

where  the  /),,  are  obtained  using  an  unbiased  simulator.  Then  (y,y  —  £;y)z,  is  unbiased 
for  (yij  —  pij)Zj,  so  consistent  estimation  is  possible  even  if  S  =  1.  This  can  greatly 
reduce  computation.  However,  there  is  an  efficiency  loss  for  low  S,  and  even  for  large 
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S  MSM  is  less  efficient  than  MSL  since  in  this  example  the  method  of  moments  is  less 
efficient  than  ML.  A  less-used  related  method  that  is  as  efficient  as  MSL  is  the  method 
of  simulated  scores  (see  Hajivassiliou  and  McFadden,  1998). 

An  alternative  estimator  uses  Bayesian  methods.  Unlike  RPL  there  is  no  closed- 
form  solution  for  the  probabilities,  which  need  to  be  derived  from  the  utilities.  The 
latent  utilities  U,  =  (Uu,  . . . ,  Ujj )  are  introduced  as  auxiliary  variables  and  the  data 
augmentation  approach  (see  Section  13.7)  is  used.  Letting  U  =  (U[, . . . ,  U,v)  and 
y  =  (yi, . . . ,  y,y )  we  have  that  the  Gibbs  sampler  cycles  among  (1)  the  conditional 
posterior  for  (3\y,  U,  X,  (2)  the  conditional  posterior  for  X  |y,  /3,  U,  and  (3)  the  poste¬ 
rior  for  U;  |y,  (3,  X.  Albert  and  Chib  (1993)  provide  a  quite  general  treatment  for  both 
unordered  and  ordered  multinomial  models.  McCulloch  and  Rossi  (1994)  provide  a 
substantive  MNP  application.  Chib  (2001)  discusses  the  complication  of  imposing  the 
restrictions  on  X  needed  for  identification  (see  Section  15.8.1). 

15.8.3.  Discussion 

Both  MNP  and  RPL  models  lack  a  closed-form  solution  for  .  However,  for  RPL 
there  is  at  least  a  closed-form  solution  conditional  on  /3,  and  the  only  problem  is  to 
integrate  out  /3,  .  For  the  MNP  model,  which  predates  the  RPL  model,  there  is  no  such 
conditional  result  and  approximating  ptJ  becomes  more  challenging,  especially  if  pt  j 
is  close  to  zero  or  one.  It  appears  to  be  easier  to  get  model  flexibility  through  nested 
logit,  RPL  or  mixture  models  than  by  use  of  MNP. 


15.9.  Ordered,  Sequential,  and  Ranked  Outcomes 

In  this  section  we  present  models  with  more  structure  than  unordered  models,  such  as 
those  with  a  natural  ordering  of  alternatives  or  sequencing  of  decisions.  Analysis  is 
straightforward  as  appropriate  models  are  well  established  and  estimation  is  again  by 
MLE  based  on  (15.4),  with  different  models  leading  to  different  specifications  of  the 
probabilities 


15.9.1.  Ordered  Multinomial  Models 

Suppose  that  there  is  a  natural  ordering  of  alternatives.  For  example,  self-rated  health 
status  may  be  one  of  excellent,  good,  fair,  or  poor.  Such  data  can  be  estimated  by 
an  unordered  multinomial  model,  but  a  much  more  parsimonious  model  and  sensible 
model  is  one  that  takes  account  of  this  ordering. 

The  starting  point  is  an  index  model,  with  single  latent  variable 

y*  —  X//3  +  Uj,  (15.47) 

where  x  here  does  not  include  an  intercept,  a  departure  from  Section  14.4.1.  As  y* 
crosses  a  series  of  increasing  unknown  thresholds  we  move  up  the  ordering  of  alter¬ 
natives.  For  example,  for  very  low  y*  health  status  is  poor,  for  y*  >  ai  health  status 
improves  to  fair,  for  y*  >  012  it  improves  further  to  good,  and  so  on. 
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In  general  for  an  ///-alternative  ordered  model  we  define 


y i  =  j  if«/_i  <  y*  <  otj, 

where  o!0  =  —  oo  and  oim  =  oo.  Then 

Pr[y;  =  j\  =  Pr[o',-i  <  y*  <  a j] 

=  Pr[a/_i  <  xJ/3  +  Uj  <  oij] 

=  Pr[a7_i  -  x'i(3  <  ui  <  o/j  -  x-/3] 
=  F(aj  -  x'/3)  -  F(aj_i  -  x'/3), 


(15.48) 


(15.49) 


where  F  is  the  cdf  of  //,.  The  regression  parameters  /3  and  the  (in  —  1)  threshold 
parameters  aq, . . . ,  i  are  obtained  by  maximizing  the  log-likelihood  (15.5)  with 
Pij  defined  in  (15.49).  For  the  ordered  logit  model  u  is  logistic  distributed  with 
F(z)  =  ez /(\  +  ez).  For  the  ordered  probit  model  u  is  standard  normal  distributed 
and  F(-)  is  the  standard  normal  cdf.  Letting  K  denote  the  number  of  regressors  ex¬ 
cluding  the  intercept,  an  ///-choice  ordered  model  has  K  +  m  —  1  parameters  whereas 
an  MNL  model  has  (m  —  I )( K  +  1)  parameters. 

The  sign  of  the  regression  parameters  (3  can  be  immediately  interpreted  as  deter¬ 
mining  whether  or  not  the  latent  variable  y*  increases  with  the  regressor.  For  marginal 
effects  in  the  probabilities 

8Pr[f  =  J]  -  {F'iaj. i  -  x'/3)  -  F'(a,  -  x(/3)}/3, 

3  x,- 

where  F'  denotes  the  derivative  of  F.  The  term  in  braces  can  be  positive  or  negative. 

This  model  can  also  be  applied  to  count  data  that  take  just  a  few  values.  Cameron 
and  Trivedi  (1986)  applied  the  ordered  probit  model  to  number  of  doctor  consultations. 
Hausman,  Lo,  and  MacKinley  (1992)  applied  the  ordered  probit  to  data  on  changes 
in  a  count,  which  can  be  negative,  and  additionally  modeled  the  error  term  w,  to  be 
heteroskedastic. 


15.9.2.  Sequential  Multinomial  Models 

In  some  situations  decisions  are  made  sequentially.  For  example,  one  might  first  de¬ 
cide  whether  or  not  to  go  to  college.  If  no  college  is  chosen  then  y  =  1.  If  y  ^  1 
then  decide  whether  to  go  to  a  two-year  college  (y  =  2)  or  four-year  college  (y  =  3). 
Given  specification  of  this  sequence  the  probabilities  are  easily  obtained.  For  exam¬ 
ple,  model  the  first  decision  by  a  probit  model  and  the  second  decision,  if  relevant, 
by  a  probit  model.  Then  Pr[y  =  1]  =  0(xj/3| )  and  Pr[_y  =  2|y  /  1]  =  Tkxj/T).  The 
unconditional  probability  is 

Pr[y  =  2]  =  Pr[y  =  2\y  /  1]  x  Pr[y  /  I]  =  <D(x'2/32)(  1  -  GCx^j)). 

The  parameters  /3l  and  (32  can  be  estimated  by  maximizing  the  log-likelihood  function 
(15.5),  where  pu  =  0(xj(/3| ),  p2,  is  given  in  the  preceding  equation,  and  p\,  =  1  — 
Pu  ~  Pi\- 
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This  approach  relies  on  correct  specification  of  the  sequence  of  decision  making.  A 
better  model  for  this  choice  example  may  be  a  three-choice  nested  logit  model  where 
the  errors  in  the  utilities  for  two-year  college  and  four-year  college  are  correlated  with 
each  other  and  independent  of  the  error  in  the  utility  for  no  college.  These  models  can 
be  compared  using  the  likelihood-based  methods  given  in  Section  8.5. 


15.9.3.  Ranked  Data  Models 

The  models  discussed  thus  for  have  assumed  that  alternatives  are  mutually  exclusive 
and  only  one  alternative  is  chosen.  More  generally,  alternatives  may  be  ranked,  espe¬ 
cially  with  stated  preference  data.  For  example,  both  first  and  second  choices  may  be 
known. 

The  rank-ordered  logit  model  is  simple  to  estimate  (see  Beggs,  Cardell,  and 
Hausman,  1981).  Consider  a  four- alternative  conditional  logit  model  with  alternative 
2  the  first  choice  and  alternative  3  the  second  choice.  Alternative  2  is  chosen  from  all 
four  alternatives  and  then  alternative  3  is  chosen  from  the  remaining  alternatives  1,  3, 
and  4.  The  joint  probability  of  these  first  and  second  choices  is 

e*'2/3 

_  X  _ 

£xiid  _|_  V i d  Vi d  _|_  ^Vsd  ^x(*4d 

Estimation  is  by  ML  given  similar  expressions  for  the  other  1 1  joint  probabilities. 

For  the  multinomial  probit  model  there  is  no  similar  simplification.  Flajivassiliou 
and  Ruud  (1994)  present  a  method  to  simulate  the  joint  probabilities;  they  use  the 
rank-ordered  probit  model  to  illustrate  a  variety  of  simulation-based  estimators. 


15.10.  Multivariate  Discrete  Outcomes 

The  preceding  models,  aside  from  rank-ordered  models,  are  models  for  one  discrete 
dependent  variable  that  takes  one  of  m  mutually  exclusive  values.  Now  we  consider 
models  when  there  is  more  than  one  discrete  outcome.  The  log-likelihood  function 
is  similar  to  (15.5)  for  the  multinomial  model,  with  different  models  corresponding 
to  different  functional  forms  for  the  probabilities.  These  probabilities  may  need  to 
account  for  correlation  of  the  two  outcomes  and  possibly  simultaneity. 


15.10.1.  Bivariate  Discrete  Outcomes 

For  simplicity  consider  bivariate  discrete  data  (yn  ,  y2i)-  For  example,  in  a  joint 
model  of  labor  supply  and  fertility  the  dependent  variables  ivi, ,  yu )  for  individual 
i  may  be  vi,  =  2  if  work  and  y  i ,  =  1  do  not  work,  and  y2i  =  2  if  have  children  and 
y2j  =  1  if  have  no  children. 

More  generally,  yi  may  take  values  1, . . .  ,m\  and  y2  may  take  values  1, . . . ,  m2. 
For  individual  i  define 

Pijk  =  Pi' [>’i(  =  j,  yn  =  k],  7  =  1,...,  mi,  k  —  1, - m2.  (15.50) 
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Note  that  p,  jk  define  probabilities  of  mutually  exclusive  events  and  ■  'Y_,k  p,  jk  =  1 . 
Define  m\  x  m2  corresponding  binary  indicator  variables  yp  =  1  if  (yi  =  j,  y2  =  k) 
and  yjk  =  0  otherwise.  Then  the  joint  density  for  the  /th  observation  is 


m  i  m2 

f  On ,  yii) = ri  n  p^k  ■ 


k=  1  1=1 


The  log-likelihood  is  then  H'j=  i  }’ijk  In  Pijk  and  estimation  is  by  ML  as  in 

Section  15.4.2. 

The  essential  difference  between  the  multivariate  and  multinomial  models  is  in  the 
specification  of  the  functional  form  for  the  probabilities. 

In  the  simplest  case  the  two  discrete  dependent  variables  are  independent  and  p{jk  = 
Pr[y1(  =  /]  x  Pr[y2,  =  k  ].  Then  yi  and  V2  can  be  modeled  using  separate  multinomial 
models. 

If  the  two  variables  are  instead  viewed  as  interrelated,  a  simple  approach  is  to  use  a 
multinomial  logit  model  for  the  probabilities  pip .  Then  the  bivariate  outcomes  (yi,  y2) 
are  essentially  treated  as  m\  x  mi  univariate  outcomes.  For  example,  in  the  labor  sup¬ 
ply  and  fertility  example  one  of  the  four  outcomes  is  then  work  and  have  children. 

In  the  next  section  we  consider  models  between  these  two  extremes. 


15.10.2.  Bivariate  Probit 

The  bivariate  probit  model  is  a  joint  model  for  two  binary  outcomes  that  generalizes 
the  index  function  model  (see  Section  14.4.1)  from  one  latent  variable  to  two  latent 
variables  that  may  be  correlated. 

Define  the  unobserved  latent  variables 


y*  =  xj  /3 1  +  ej ,  (15.51) 

y*  =  X'2(32  +  S2, 


where  the  8\  and  e2  are  joint  normal  with  means  zero,  variances  one,  and  correlation 
p.  Then  the  bivariate  probit  model  specifies  the  observed  outcomes  to  be 


2  if  y*  >  0, 

1  if  y*  <  0, 

2  if  y?  >  0, 
1  if  y|  <  0, 


where  we  use  values  (2,  1)  rather  than  (1,0)  to  be  consistent  with  the  notation  of  this 
chapter.  This  model  collapses  to  two  separate  probit  models  for  y\  and  y2  when  the 
error  correlation  p  =  0. 
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When  p  /  0  there  is  no  closed-form  solution  for  the  probabilities.  For  example, 

P22  =  Pr  [>'i  =  2,  y2  —  2] 

=  Pr  [y;  >  0,  y2*  >0] 

=  Pr  [— £i  <  x j  /3 ! ,  —  e2  <  x2/32] 

=  Pr[ei  <  x'j (3U  s2  <  x2/32] 

/APi  r<P i 

/  <t>{zu  Z2,  p)dz\dl2 

-OO  J  —  OO 

=  <h(x,l/3l,  x2/32,  p), 

where  (j){z\,  Zi .  P)  and  Of-] ,  ?2,  p)  are,  respectively,  the  standardized  bivariate  normal 
density  and  cdf  for  {z\,  Zi)  with  zero  means,  unit  variances,  and  correlation  p,  and  the 
fourth  equality  holds  for  the  bivariate  normal  with  mean  zero. 

Performing  similar  algebra  for  the  other  possible  outcomes  yields 

Pjk  =  Pr  [yi  =  j,  y2  =  k] 

=  <bt<pix'l/3|,  42x2/32,  p), 

where  cp  =  I  if  yt  =  2  and  cp  =  —  1  if  y;  =  1  for  l  =  1 , 2.  This  is  the  basis  for  ML 
estimation,  detailed  in  Greene  (2003),  who  also  considers  computation  of  marginal 
effects. 

Implementation  requires  evaluation  of  a  bivariate  normal  integral,  which  is  numer¬ 
ically  feasible.  Generalizations  to  multivariate  probit  are  obvious  though  will  experi¬ 
ence  numerical  challenges  because  of  higher  order  integrals.  If  each  outcome  is  or¬ 
dered  then  the  model  can  be  generalized  to  a  bivariate  ordered  probit  model. 

One  can  also  consider  a  simultaneous  equations  probit  model  that  generalizes 
(15.51)  to  allow  the  right-hand  side  variables  to  be  endogenous.  For  example,  the  first 
equation  for  y*  may  include  y*  and/or  y2  as  regressors  and  similarly  for  y|,  with  some 
restrictions  required  to  ensure  the  model  is  identified.  This  model  is  similar  to  the 
simultaneous  equations  Tobit  model  discussed  in  Section  16.8.2. 


15.11.  Semiparametric  Estimation 

Some  studies  have  extended  semiparametric  estimation  methods  to  models  for  un¬ 
ordered  multinomial  data.  Abe  (1999)  estimated  the  conditional  logit  model  with  x'^/3 
in  (15.10)  replaced  by  the  additive  model  form  /3pfp(Xjjp),  where  p  denotes  the 
pth  component  of  x,;  and  the  function  /)/■)  is  estimated  by  the  data.  L-F.  Lee  (1995) 
extended  the  Klein  and  Spady  (1993)  estimator  (see  Section  14.7)  from  binary  out¬ 
comes  to  multinomial  outcomes.  Semiparametric  methods  for  multiple-index  models 
can  also  be  applied  to  the  multinomial  unordered  model.  The  challenge  is  to  ensure 
that  predicted  probabilities  lie  between  zero  and  one  and  sum  to  one. 

Ordered  models  lend  themselves  well  to  semiparametric  analysis  since  they  involve 
an  index  x'/3  that  crosses  a  number  of  thresholds.  See,  for  example,  Klein  and  Sherman 
(2002),  who  present  an  estimator  that  is  \[N -consistent  and  asymptotically  normal  for 
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both  regression  and  threshold  points  up  to  location  and  scale,  under  the  assumption 
that  errors  are  independent  of  regressors. 


15.12.  Derivations  for  MNL,  CL,  and  NL  Models 

We  consider  the  conditional  and  multinomial  logit  models,  deriving  first  and  second 
derivatives  of  the  log-likelihood  function  and  expressions  for  the  effect  of  changes  in 
regressors  on  the  probabilities.  Then  the  nested  logit  (NL)  model  is  derived  from  the 
GEV  model. 


15.12.1.  Conditional  Logit 

The  conditional  logit  probability  is  p,j  =  ex‘>P  /  ^  ex>uP .  Differentiation  by  parts 
yields 


3 Pu 
3/3 


eW  eW 

Ei  e^,X'J  {E,  eWj1 


l 


=  Pi^u  -  pu  ^2  puXii  =  Pijx-u  -  pu%  =  Pijfaj  -  x,o, 

i 


where  x,  =  J2i  Pnxn-  Then 


dC _ ^  ^  y’ij  dpij 

dp~  ^ 

It  follows  that 

d2c 

3/33/3' 


Y  Y  —PiMu  -  X;)  =  Y  Y  *J<*J  - xt)- 

i  J  Pii  i  j 


-YY»> 

<■  i 


dXj 

3/T 


_  9  El  Pilxil 
~  ^  ^  yij  3/T 

l  J  ^ 


~YY  yu  Y Pi,(Xii  - 

i  j  i 


Y  Y  p<j(x<j  -  Xi)x'u 

i  j 

Y  Y  p<j(x<j  -  x‘  )(x'i  -  X,  )', 

i  j 


which  is  (15.15).  The  second  to  last  equality  uses  the  fact  that  y,j  equals  one  for  ex¬ 
actly  one  of  the  choices  and  zero  otherwise,  so  that  JN  y(  /  ^  a,7  =  J]  .  y^a, 7  = 
Ejaij ,  and  the  last  equality  uses  £L  p!7(x(7  -  x,)x(  =  Ej(Pijxij  ~  Pijxi)%  = 
£/x;  -  p(/x,)x'  =  0  as  Pp  =  T 

Now  consider  the  effect  of  changing  regressors.  For  the  conditional  logit  model 


9 PU_ 

dxu 


X'ilP 


Ei  e 


P- 


eW 


(E,e^) 


;eW(3  -  Pij(  1  -  Pij)l 3, 
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whereas  for  j  7^  k 


dpL 


eW 


_  -e^Q  = 

dXik  (£, eX:,/3)2 

Combining  these  two  results  yields  (15.18). 


-PijPikP- 


15.12.2.  Multinomial  Logit 

The  multinomial  logit  probability  is  ptJ  =  /  E;  e*'13'.  Differentiation  by  parts 

yields 


9  Pu 

exW 

3/3  / 

whereas  for  k  ^  j 

dPu 

3/3  r 

eW 


3/3*  (£,ex!A) 


2ex‘;^\i  =  pijXj  -  PijPijXj. 


e^iPj 

-~2  =  - PijPikXi . 


Combining  we  have 


—  ^ijkPijXj  Pij  PikXj  —  Piji^ijk  Pi k)Xj . 

3  Pa 


where  the  indicator  variable  <$,■,/  =  1  if  j  =  k,  and 


3£  _  y^  y^  Li 
3/3*  ;  y  P;./  3/3* 


Li_  9 Pi. 
PU 
yu 


-EE  —  (SijkPij  -  />.,/>. A x,  ) 

Pij 


1  J 


=  E 


E  •''"'‘'-/A 


■  ya  Pik 


—  ^  I  V/A  P/A  ]x, 


as  stated  in  (15.16),  where  the  last  line  uses  the  definition  of  5,/*  and  E;  >’//  =  1-  For 
the  second  derivative  we  have 

32£  x  \  3/7, 


3/3/ 3/3r 


t-  =  -  E  E  iTLx'-  -~EE  putfuk  -  Pik)xrt, 


>  7 


which  yields  (15.17). 

When  regressors  change 


ex;dy 


3 

9x/  E,^A  (E^)2 


EeX:/3,A 


=  Pi//3./  -  Pij  E  ^i//3/  =  Pijifij  -Pi), 


where  /3,=  E/  PuPi,  as  stated  in  (15.19). 
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15.12.3.  Nested  Logit 

We  consider  the  two-level  GEV  model  given  by  (15.32)  and  (15.33)  with 

J  /  Kj  \  P> 

G(Y)  =  G(Yn, . . . ,  YlKl,.  ...,Yju...,  YJKj)  =  £>  £  , 

j= 1  \k= 1  / 

which  is  a  generalization  of  (15.34)  owing  to  the  coefficients  aj .  The  general  GEV 
result  (15.31)  becomes  Pr[ y; /.  =  1]  =  YjkGjk/G( Y),  where  Gjk  is  the  derivative  of 
G(Y)  with  respect  to  Yjk  and  evaluation  is  at  Yjk  =  eVjk. 

Now 


which  gives 


_3G(Y)_  «  , 

Jt~  »y„  “ 


YjkGik=aj[J2yJ,,J  Yi 


K,  \  Pi 

Vl/Pj  \  Vl/Pj 


_  YjkGjk 


(e£i  C') 


VpA^_1  yl/P; 

n  j  Ijk 


G(Y)  W  / yKi  YllPm\ 

2—/tn=\  Um  yl^l=  1  1  ml  ^ 


The  probability  of  choosing  limb  j  is 


(e£i  C')" 


EJ  1  yi  y 

-  ^  ^  W'7  n  ('V'Kl  yLP“V"  ' 

K-1  .L„i=i  am  (£/=i  Ln/  J 

after  some  simplification,  and  the  conditional  probability  of  choosing  branch  k  given 
limb  j  is 

yVPj 

,  =  Pjk_  Yjk 

klJ  Pi  £f=  1  y)[p>  ' 

This  result  is  also  given  in  Maddala  (1983,  p.  72). 

We  need  to  evaluate  these  expression  at  Yjk  =  cxp(Vjk).  Suppose 

Vjk  =  z'jtx  +  x'jk(3j. 

Then  performing  some  algebra  yields 

l„Vik\l/Pi  - - (’  \  a  /  „  w 


(eVik)  =  exp(z'.a/p7)exp  (x,jkf3j/pj)), 

£  (e^7)17^  =  exP  ftjtx/pj)  exP (7/)> 


£  (e  ^ )  =  exP  (z'i «  +  Pf  h )  - 
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where 

'j  =  In  (x )kPj/Pj ))j  • 

It  follows  that  the  probability  of  choosing  limb  j  becomes 

aj  (ggi 

Pl  ~  Ei=,«- 

4  exp  (z'a  +  p,/,) 

Em=l  «'«  (eXP  (Zma  +  Pm 4)) 

as  stated  for  the  first  term  in  (15.36).  Note  that  the  scalar  ctj  can  be  absorbed  into  z;  as 
a  limb-specific  dummy,  as  cij  exp(z'.a  +  Pjlj)  =  exp(ln cij  +  z'.a  +  Pjlj)-  Without 
loss  of  generality  we  therefore  set  aj  =  1 . 

The  probability  of  branch  k  within  limb  j  is 

(ev*)l/Pi 

^  E£i  (ev»)l/Pi 

exp  (z 'jOt/p^  exp  (x'jk(3/pj^ 

E/= l  exP  (Zy«/Pi)  exP  [^iiP/Pj) 

exp  (xEE/p/) 

E/=i  exP  (*jiP/Pj) 

as  stated  for  the  second  term  in  (15.36). 

15.13.  Practical  Considerations 

The  multinomial  logit  model  is  adequate  for  describing  data  or  estimating  the  marginal 
probabilities  but  is  viewed  as  a  poor  model  if  a  more  structural  interpretation  of  the  pa¬ 
rameters  is  required,  owing  to  the  independence  of  irrelevant  alternatives  assumption. 
Many  packages  estimate  the  multinomial  logit  model. 

The  nested  logit  model  can  be  estimated  in  STATA  and  by  using  the  NLOGIT  add¬ 
on  to  LIMDEP,  and  it  is  easy  to  code  in  a  language  such  as  GAUSS.  It  is  the  obvious 
model  to  use  if  there  is  an  obvious  nesting  structure,  but  usually  there  is  no  obvious 
structure. 

The  random  parameters  logit  model  requires  special  code  in  a  language  such  as 
GAUSS  and  requires  use  of  the  simulation-based  estimation  methods  given  in  Chap¬ 
ter  12.  Ken  Train  provides  code  at  his  Web  site  elsa.berkeley.edu/~train. 

The  multinomial  probit  model  is  even  more  challenging  to  estimate,  for  more  than 
four  choices,  and  has  met  with  relatively  little  empirical  success.  For  these  reasons  the 
random  parameters  logit  model  is  currently  preferred. 
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15.14.  Bibliographic  Notes 

15.3  Good  basic  references  for  multinomial  models  include  Amemiya  (1981,  1985),  Maddala 
(1983),  and  Greene  (2003).  The  books  by  Ben-Akiva  and  Lerman  (1985),  Train  (1986), 
and  Borsch-Supan  (1987)  provide  extensive  applications  as  well  as  a  review  of  theory. 
Train  (2003)  presents  an  outstanding  treatment  of  unordered  multinomial  models  and  on 
estimation  using  simulation  methods. 

15.5  The  seminal  article  by  McFadden  (1981)  provides  an  advanced  treatment  of  discrete 
choice  modeling,  emphasizing  the  random  utility  model  approach.  For  welfare  analysis 
see  Small  and  Rosen  (1981),  Train  (2003,  pp.  59-61)  and  Dagsvik  and  Karstrom  (2004). 

15.6  Borsch-Supan  (1987)  gives  an  excellent  exposition  and  application  of  the  nested  logit 
model. 

15.7  The  random  parameters  logit  model  and  other  recent  advances  are  well  covered  in  Train 
(2003).  Revelt  and  Train  (1998)  provide  an  early  application. 

15.8  Bolduc  (1999)  presents  MSL  estimation  of  a  nine-choice  multinomial  probit  model. 


- Exercises - 

1 5-1  Consider  a  latent  variable  modeled  by  y*  =  x'(3  +  e,  with  s  ~  AC[0, 1  ].  Suppose 
we  observe  only  y=  2  if  y*  <  a,  y=  1  if  a.  <  y*  <  U,  and  y=  0  if  y*  >  U,  where 
the  upper  limit  U  is  a  known  constant  for  each  individual  (i.e.,  data)  and  may 
differ  over  individuals,  but  a  is  unknown. 

(a)  Obtain  the  conditional  probabilities  that  y  =  0,  y  =  1,  and  y  =  2  . 

(b)  Provide  details  on  a  method  to  consistently  estimate  (3  and  a. 

1 5-2  Use  a  50%  subsample  of  the  fishing  mode  choice  data  of  Section  1 5.2. 

(a)  Estimate  the  conditional  logit  model  of  Section  15.2.1 . 

(b)  Comment  on  the  statistical  significance  of  parameter  estimates. 

(c)  What  is  the  effect  of  an  increase  in  price  on  the  various  modes  of  fishing? 

1 5-3  Use  a  50%  subsample  of  the  fishing  mode  choice  data  of  Section  1 5.2. 

(a)  Estimate  the  multinomial  logit  model  of  Section  15.2.2. 

(b)  Comment  on  the  statistical  significance  of  parameter  estimates. 

(c)  What  is  the  effect  of  an  increase  in  income  on  the  various  modes  of  fishing? 

1 5-4  Use  a  50%  subsample  of  the  fishing  mode  choice  data  of  Section  1 5.2.  Suppose 
we  collapse  the  model  to  three  alternatives  and  order  the  alternatives,  with  y  =  0 
if  fishing  from  a  pier  or  beach,  y  =  1  if  fishing  from  a  private  boat  and  y  =  2  if 
fishing  from  a  charter  boat. 

(a)  Estimate  an  ordered  logit  model  with  income  as  the  only  regressor. 

(b)  Provide  an  interpretation  of  the  estimated  coefficient. 

(c)  Compare  the  fit  of  this  model  with  that  from  a  three-choice  multinomial 
model  with  income  as  the  regressor. 
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16.1.  Introduction 

In  this  chapter  we  consider  two  closely  related  topics:  regression  when  the  depen¬ 
dent  variable  of  interest  is  incompletely  observed  and  regression  when  the  dependent 
variable  is  completely  observed  but  is  observed  in  a  selected  sample  that  is  not  rep¬ 
resentative  of  the  population.  This  includes  limited  dependent  variable  models,  latent 
variable  models,  generalized  Tobit  models,  and  selection  models. 

All  these  models  share  the  common  feature  that  even  in  the  simplest  case  of  pop¬ 
ulation  conditional  mean  linear  in  regressors,  OLS  regression  leads  to  inconsistent 
parameter  estimates  because  the  sample  is  not  representative  of  the  population.  Alter¬ 
native  estimation  procedures,  most  relying  on  strong  distributional  assumptions,  are 
necessary  to  ensure  consistent  parameter  estimation. 

Leading  causes  of  incompletely  observed  data  are  truncation  and  censoring.  For 
truncated  data  some  observations  on  both  the  dependent  variable  and  regressors  are 
lost.  For  example,  income  may  be  the  dependent  variable  and  only  low-income  people 
are  included  in  the  sample.  For  censored  data  information  on  the  dependent  variable  is 
lost,  but  not  data  on  the  regressors.  For  example,  people  of  all  income  levels  may  be  in¬ 
cluded  in  the  sample,  but  for  confidentiality  reasons  the  income  of  high-income  people 
may  be  top-coded  and  reported  only  as  exceeding,  say,  $100,000  per  year.  Truncation 
entails  greater  information  loss  than  does  censoring.  A  leading  example  of  truncation 
and  censoring  is  the  Tobit  model,  named  after  Tobin  (1958),  who  considered  linear 
regression  under  normality.  Similar  issues  arise  for  truncation  and  censoring  in  other 
models  introduced  in  later  chapters,  most  notably  for  censored  duration  data  presented 
in  Chapter  17.  More  generally,  truncation  and  censoring  are  examples  of  missing  data 
problems  that  are  studied  in  Chapter  27. 

The  first-generation  estimation  methods  require  strong  distributional  assumptions. 
Even  seemingly  minor  departures  from  assumptions,  such  as  heteroskedastic  errors 
when  homoskedastic  errors  are  assumed,  can  lead  to  inconsistent  parameter  estimates. 
For  this  reason  the  models  presented  in  this  chapter  provide  a  leading  econometrics 
application  of  semiparametric  regression  methods.  Semiparametric  methods  for  simple 
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forms  of  censoring  and  truncation  such  as  top-coding  have  been  successfully  applied. 
However,  for  more  general  models  with  selection  on  unobservables  there  is  to  date  no 
widely  accepted  procedure. 

Section  16.2  presents  general  theory  for  censored  and  truncated  nonlinear  regres¬ 
sion  models,  with  specialization  to  the  Tobit  model  given  in  Section  16.3.  An  alterna¬ 
tive  model  for  censored  data,  the  two-part  model,  is  introduced  in  Section  16.4.  The 
sample  selection  model  is  presented  in  Section  16.5.  An  application  to  health  expen¬ 
ditures  in  Section  16.6  contrasts  the  two-part  and  sample  selection  models.  The  Roy 
model  for  unobserved  counterfactuals  is  presented  in  Section  16.7.  Section  16.8  con¬ 
siders  fully  structural  models  obtained  by  utility  maximization  with  comer  solutions 
or  by  extension  of  simultaneous  equation  models  to  selected  samples.  Semiparametric 
estimation  is  presented  in  Section  16.9. 

16.2.  Censored  and  Truncated  Models 

We  present  general  methods  for  estimation  of  fully  parametric  models  when  data  are 
censored  or  truncated.  These  methods  can  be  applied  to  models  presented  in  later 
chapters  such  as  count  and  duration  models.  The  leading  example,  the  Tobit  model 
for  censoring  or  truncation  in  linear  models,  is  introduced  in  Section  16.2.1  and  given 
separate  treatment  in  Section  16.3. 

16.2.1.  Censoring  and  Truncation  Example 

Let  y*  denote  a  variable  that  is  incompletely  observed.  For  truncation  from  below,  y* 
is  only  observed  if  y*  exceeds  a  threshold.  For  simplicity,  let  that  threshold  be  zero. 
Then  we  observe  y  =  y*  if  y*  >  0.  Since  negative  values  do  not  appear  in  the  sample, 
the  truncated  mean  exceeds  the  mean  of  y*.  For  censoring  from  below  at  zero,  y*  is 
not  completely  observed  when  y*  <  0,  but  it  is  known  that  y*  <  0  and  for  simplicity 
y  is  then  set  to  0.  Since  negative  values  are  scaled  up  to  zero,  the  censored  mean 
also  exceeds  the  mean  of  y*.  Clearly,  sample  means  in  truncated  or  censored  samples 
cannot  be  used  without  adjustment  to  estimate  the  original  population  mean. 

This  chapter  studies  similar  issues  for  regression  models.  With  luck,  truncation  and 
censoring  might  lead  only  to  a  shift  up  or  down  in  the  intercept,  leaving  slope  coeffi¬ 
cients  unchanged;  however,  this  is  not  the  case.  For  example,  if  E[y*|x]  =  x'(3  in  the 
original  model  then  truncation  or  censoring  leads  to  E[y|x]  being  nonlinear  in  x  and 
f3  so  that  OLS  gives  inconsistent  estimates  of  (3  and  hence  inconsistent  estimates  of 
marginal  effects. 

As  an  illustration  we  consider  the  following  labor  supply  example  with  simulated 
data.  The  relationship  between  desired  annual  hours  worked,  y*,  and  hourly  wage,  w, 
is  specified  to  be  of  linear- log  form  with  data-generation  process 

y*  =  -2500+  1000  In  w  +  e,  (16.1) 

e  ~  A/"[0,  10002], 

In  in  ~  Af[2.75,0.602]. 
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Tobit:  Censored  and  Truncated  Means 
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Figure  16.1:  Tobit  regression  of  hours  on  log  wage:  uncensored  conditional  mean 
(bottom),  censored  conditional  mean  (middle),  and  truncated  conditional  mean  (top)  for 
censoring/truncation  from  below  at  zero  hours.  Data  are  generated  from  a  classical  linear 
regression  model. 


This  is  a  Tobit  model,  studied  in  detail  in  Section  16.3.  The  model  implies  that  the 
wage  elasticity  is  1000/y*,  which  equals,  for  example,  0.5  for  full-time  work  (2,000 
hours).  For  each  1%  increase  in  wage,  annual  hours  increase  by  10  hours. 

Figure  16.1  presents  a  scatter  plot  of  y*  and  In  w  for  a  generated  sample  of  200 
observations.  The  unconditional  mean  for  y*,  which  is  —2500  +  1 000  In  w,  is  given 
by  the  lowest  curve,  which  is  a  straight  line. 

With  censoring  at  zero,  negative  values  of  y*  are  set  to  zero  because  people  with 
negative  desired  hours  of  work  choose  not  to  work.  For  this  particular  sample  this 
is  the  case  for  about  35%  of  the  observations.  This  pushes  up  the  mean  for  low 
wages,  since  the  many  negative  values  of  the  y*  are  shifted  up  to  zero.  It  has  little 
impact  for  high  wages,  since  then  few  observations  on  y*  are  zero.  The  middle  curve 
in  Figure  16.1  gives  the  resulting  censored  mean,  using  the  formula  given  later  in 
(16.23). 

With  truncation  at  zero  the  35%  of  the  population  with  negative  values  of  y*  are 
dropped  altogether.  This  increases  the  mean  above  the  censored  mean,  since  zero 
values  are  no  longer  included  in  the  data  used  to  form  the  mean.  The  upper  curve 
in  Figure  16.1  gives  the  resulting  truncated  mean,  using  the  formula  given  later  in 
(16.23). 

It  is  clear  that  censored  and  truncated  conditional  means  are  nonlinear  in  x  even 
if  the  underlying  population  mean  is  linear.  OLS  estimation  using  truncated  or  cen¬ 
sored  data  will  lead  to  inconsistent  estimation  of  the  slope  parameter,  since  by  vi¬ 
sual  inspection  of  Figure  16.1  a  linear  approximation  to  the  nonlinear  truncated  and 
censored  means  will  have  flatter  slope  than  that  for  the  original  untruncated  mean. 
Analysis  should  instead  be  based  on  the  formulas  for  the  censored  or  truncated  condi¬ 
tional  mean.  Unfortunately  these  are  based  on  strong  distributional  assumptions,  as  we 
will  see. 
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16.2.2.  Censoring  and  Truncation  Mechanisms 

As  is  customary  for  regression  analysis,  we  let  y  denote  the  observed  value  of  the 
dependent  variable.  The  departure  from  usual  analysis  is  that  y  is  the  incompletely 
observed  value  of  a  latent  dependent  variable  y*,  where  the  observation  rule  is 

>'  =  g(y% 

for  some  specified  function  g(-).  Leading  examples  of  g(-)  immediately  follow. 


Censoring 


With  censoring  we  always  observe  the  regressors  x,  completely  observe  y*  for  a  subset 
of  the  possible  values  of  y*,  and  incompletely  observe  y  for  the  remaining  possible 
values  of  y*.  If  censoring  is  from  below  (or  from  the  left),  we  observe 


y*  if  y*  >  L 
L  if  y*  <  L. 


(16.2) 


For  example,  all  consumers  may  be  sampled  with  some  having  positive  durable  goods 
expenditures  (y*  >  0)  and  others  having  zero  expenditures  (y*  <  0).  If  censoring  is 
from  above  (or  from  the  right)  we  observe 


y*  if  y*  <U 
U  if  y*  >  U. 


(16.3) 


For  example,  annual  income  data  may  be  top-coded  at  U  =  $100,000.  This  form  of 
censoring  is  called  type  1  censoring  in  the  duration  literature  (see  Section  17.4.1). 

The  incompletely  observed  observations  on  y*  are  set  to  L  or  U  for  simplicity. 
More  generally,  we  require  that  for  incompletely  observed  observations  y*  is  known 
to  be  missing  (i.e.,  we  observe  that  y*  lies  outside  the  relevant  bound)  and  regressors 
x  continue  to  be  completely  observed. 


Truncation 

Truncation  entails  additional  information  loss  as  all  data  on  observations  at  the  bound 
are  lost.  With  truncation  from  below  we  observe  only 

y  =  y*  if  y*  >  L.  (16.4) 

For  example,  only  consumers  who  purchased  durable  goods  may  be  sampled  ( L  =  0). 
With  truncation  from  above  we  observe  only 

y  =  y*  if  y*  <  U.  (16.5) 

For  example,  only  low-income  individuals  may  be  sampled. 


Interval  Data 

Interval  data  are  data  recorded  in  intervals.  Survey  data  are  often  collected  in  this 
way  to  aid  recall  and  to  provide  some  greater  anonymity  in  responses  to  more  personal 
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questions.  For  example,  income  may  be  reported  in  intervals  of  $10,000  and  then  top- 
coded  at  $100,000.  Such  data  are  censored  at  multiple  points,  with  the  observed  data 
y  being  the  particular  interval  in  which  the  unobserved  y*  lies. 


16.2.3.  Censored  and  Truncated  MLE 

Censoring  and  truncation  are  easily  dealt  with  if  the  researcher  applies  a  fully  para¬ 
metric  approach.  This  may  be  the  case  with  interval  data  or  top-coded  data  where,  for 
example,  it  may  be  reasonable  to  assume  a  log-normal  distribution  for  earnings  or  a 
negative  binomial  model  for  number  of  doctor  visits. 

If  the  conditional  distribution  of  y*  given  regressors  x  is  specified,  then  the  parame¬ 
ters  of  this  distribution  can  be  consistently  and  efficiently  estimated  by  ML  estimation 
based  on  the  conditional  distribution  of  the  censored  or  truncated  y.  Specifically,  let 
/*(y*|x)  and  F*(y*|x)  denote  the  conditional  probability  density  function  (or  prob¬ 
ability  mass  function)  and  cumulative  distribution  function  of  the  latent  variable  y*. 
Then  one  can  always  obtain  /(y[x)  and  F(y|x),  the  corresponding  conditional  pdf  and 
cdf  of  the  observed  dependent  variable  y,  since  y  =  g(y*)  is  a  transformation  of  y*. 

The  limitation  of  the  parametric  approach  is  its  reliance  on  strong  distributional 
assumptions.  For  example,  for  the  linear  regression  model  under  normality  the  MLE 
remains  consistent  even  if  the  errors  are  nonnormal,  but  the  censored  MLE  becomes 
inconsistent  if  the  errors  are  nonnormal  (see  Section  16.3.2).  More  flexible  models  and 
semiparametric  methods  are  presented  in  later  sections. 


Censored  MLE 


Censoring  and  truncation  change  both  the  conditional  mean  and  the  conditional  den¬ 
sity.  We  begin  with  the  density. 

Consider  ML  estimation  given  censoring  from  below.  For  y  >  L  the  density  of  y  is 
the  same  as  that  for  y*,  so  f(y  |x)  =  /*(y|x).  For  y  =  L,  the  lower  bound,  the  density 
is  discrete  with  mass  equal  to  the  probability  of  observing  y*  <  L,  or  F*(L\x).  Thus 
for  censoring  from  below 


/(.v|x)  = 


/*(yl*) 

F*{L\x) 


if  v  >  L, 
if  y  —  L. 


As  mentioned  after  (16.3),  setting  y  =  L  when  y*  <  L  is  not  necessary.  Even  if  no 
value  of  y  is  observed  when  y*  <  L  the  density  is  still  F*(L\x). 

The  density  is  a  hybrid  of  the  pdf  and  cdf  of  y*.  Similar  to  analysis  for  binary 
outcome  models,  it  is  notationally  convenient  to  introduce  an  indicator  variable 


d  = 


1  if  y  >  L , 
0  if  y  =  L . 


Then  the  conditional  density  given  censoring  from  below  can  be  written  as 

f{y\x)  =  f*(y\x)dF*(L\x)l~d. 


(16.6) 


(16.7) 
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For  a  sample  of  N  independent  observations,  the  censored  MLE  maximizes 

N 

InL n(G)  =  {di  ln.r(v,  |x;,  G)  +  (1  -  d,)  in  F*(L,  |x,-,  0)}  ,  (16.8) 

(=i 

where  6  are  the  parameters  of  the  distribution  of  y*.  For  generality  the  censoring  lower 
bound  Lf  is  permitted  to  vary  across  individuals,  though  usually  L,  =  L.  The  censored 
MLE  is  consistent  and  asymptotically  normal,  provided  the  original  density  of  the 
uncensored  variable  /*(y*|x,  B)  is  correctly  specified. 

When  censoring  is  instead  from  above,  the  log-likelihood  is  similar  to  (16.8), 
except  now  d  =  1  if  y  <  U  and  d  =  0  otherwise,  and  F*(L\x.  G)  is  replaced  by 
1  —  F*(U\x,  6).  A  leading  example  is  right-censored  duration  data  (see  Secdon  17.4). 


Truncated  MLE 

For  truncation  from  below  at  L,  and  suppressing  dependence  on  x,  the  conditional 
density  of  the  observed  y  is 

/(.v)  =  f*(y\y  >  L) 

=  /*O0/Pr[y|y>L] 

=  f*(y)/[ i  - 

The  truncated  MLE  therefore  maximizes 

N 

InL n(G)  =  J2  {ln/*(.Vi|x,,  6)  -  ln[l  -  F*(L,|X/,  0)]}  .  (16.9) 

(=i 

If  instead  truncation  is  from  above,  the  log-likelihood  is  (16.9),  except  that  1  — 
F*(L\x,  6)  is  replaced  by  F*(U |x,  6). 

Ignoring  censoring  or  truncation  leads  to  inconsistency.  For  example,  if  truncation 
is  ignored  the  MLE  maximizes  JT  In  f*(y,  |x, ,  G),  which  is  the  wrong  likelihood  func¬ 
tion  as  it  drops  the  second  term  in  (16.9).  Consistency  of  the  censored  and  truncated 
MLE  requires  correct  specification  of  /(•),  which  in  turn  requires  correct  specifica¬ 
tion  of  the  latent  variable  density  /*(•).  Even  if  /*(•)  is  an  LEF  density  (see  Section 
5.7.3),  the  density,  and  not  just  the  mean,  must  be  correctly  specified  if  censoring  or 
truncation  are  present. 


Interval  Data  MLE 

Suppose  the  latent  variable  y*  is  only  observed  to  lie  in  the  (7  +  1)  mutually  exclusive 
intervals  (— oo,  a{\,  (ai,  <22])  •  •  •.  (flj,  00),  where  a\,  02, . . .,  aj  are  known.  Then  since 

Pr [aj  <y*<  «,+  ,]  =  Pr[y*  <  aj+l]  -  Pr[y*  <  aj] 

-  F*(aj+ 1)  -  F*(aj), 

the  interval  data  MLE  maximizes 

N  ] 

lnLjvW  =  d‘i ln  [F*(aj+i\Xi,  6)  -  F*(a7|x;,  Q)\ ,  (16.10) 

«  =  1  7=0 
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where  the  djj,  j  =  0, . . .,  J,  are  binary  indicators  equal  to  one  if  yy  e  (aj,  o;+i]  and 
zero  otherwise.  This  is  similar  to  an  ordered  probit  or  logit  model  (see  Section  15.9.1), 
except  here  the  interval  boundaries  a\, . . .,  aj  are  known. 


16.2.4.  Poisson  Censored  and  Truncated  MLE  Example 

Assume  that  y*  is  Poisson  distributed,  so  that  f*(y)  =  e~^ixy /y\  and  In  f  *(y)  = 
—  /i  +  y  In  n  —  In  _y !,  with  mean  //  =  exp(x'/3). 

Suppose  the  number  of  visits  to  a  health  clinic  is  modeled,  but  data  are  only  avail¬ 
able  for  people  who  visited  the  health  clinic.  Then  the  data  are  truncated  from  below 
at  zero  and  we  only  observe  y  =  y*  if  y*  >  0.  Then  F*( 0)  =  Pr[y*  <  0]  =  Pr[y*  = 
0]  =  .  and  from  (16.9)  the  truncated  MLE  for  / 3  maximizes 

N 

lnLw(/3)  =  ^2  { ~  exp(x'/3)+>',x'/3—  lny,-!  -  ln[l  -  exp(-  exp(x'/3))]} . 

(=i 


Suppose  instead  that  data  are  censored  from  above  at  10  because  of  top-coding,  so 
that  we  observe  y  =  y*  if  v*  <  10  and  that  y  =  10  if  y*  >  10.  Then  Pr[y*  >  10]  = 
1  —  Pr[y*  <  10]  =  1  —  Ylt= o  f*(ky  From  (16.8)  the  censored  MLE  for  (3  maximizes 


In L,v(/3)  =  Y2  [~  exP (x-/3)  +  y,X;/3  -  In  y,- !] 


+  d  -d.)  In 


^e-exp(x;./3)(exp  {x>if3))k/k\ 


k= 0 


In  both  cases  the  resulting  first-order  conditions  are  considerably  more  complicated 
than  those  for  the  Poisson  MLE  without  truncation  or  censoring.  Also,  in  both  cases 
ignoring  the  truncation  or  censoring  and  maximizing  the  original  density  leads  to  in¬ 
consistent  parameter  estimates. 


16.2.5.  Censored  and  Truncated  Conditional  Means 

Censoring  and  truncation  change  the  conditional  mean. 

For  example,  consider  the  Poisson  truncated  from  below  at  zero.  The  truncated  den¬ 
sity  is  f*(v)/[l  —  F*(0)],  v  =  1,2, _ ,  so  the  truncated  mean  is  Y'fC  kf*(k)/\  1  — 

F*m  =  IXo  */*(*)/[  1  -  *”(0)]  =  M/(l  -  Thus 

E[v|x]  =exp(x'/3)/[l  -  exp(-  exp(x'/3))], 

rather  than  exp(x'/3)  if  there  were  no  truncation. 

This  expression  for  E[y|x]  can  be  used  for  NLS  estimation.  There  is  little  advantage 
to  NLS  rather  than  ML  estimation,  however,  as  given  truncation  the  NLS  estimator 
relies  on  distributional  assumptions  that  are  essentially  as  strong  as  those  needed  for 
consistency  of  the  more  efficient  ML  estimator. 
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16.3.  Tobit  Model 

Truncation  and  censoring  arise  most  often  in  econometrics  in  the  linear  regression 
model  with  normally  distributed  error,  when  only  positive  outcomes  are  completely 
observed.  This  model  is  called  the  Tobit  model  after  Tobin  (1958),  who  applied  it  to 
individual  expenditures  on  consumer  durable  goods.  The  model  in  practice  is  usually 
too  restrictive.  It  is  nonetheless  presented  in  some  detail,  as  it  provides  the  basis  for 
more  general  models  presented  in  subsequent  sections  of  this  chapter. 


16.3.1.  Tobit  Model 

The  censored  normal  regression  model,  or  Tobit  model,  is  one  with  censoring  from 
below  at  zero  where  the  latent  variable  is  linear  in  regressors  with  additive  error  that  is 
normally  distributed  and  homoskedastic.  Thus 

y*  =  x73  +  e,  (16.11) 


where  the  error  term 


s  ~  A([0,  a2] 


(16.12) 


has  variance  a2  constant  across  observations.  This  implies  that  the  latent  variable  y*  ~ 
A f[x'f3,  a2].  The  observed  y  is  defined  by  (16.2)  with  L  =  0,  so 


y*  if  y*  >  0, 
-  if  y*  <  0, 


(16.13) 


where  -  means  that  y  is  observed  to  be  missing.  No  particular  value  of  y  is  necessarily 
observed  when  y*  <  0,  though  in  some  settings  such  as  durable  goods  expenditures 
we  observe  y  =  0. 

Equations  (16.11)  -  (16.13)  define  the  prototypical  Tobit  model  analyzed  by  To¬ 
bin  (1958).  More  generally,  Tobit  models  begin  with  (16.1 1)  and  (16.12)  for  the  latent 
variable  but  can  have  other  censoring  mechanisms  including  censoring  from  above, 
censoring  from  both  below  and  above  (the  two-limit  Tobit  model),  and  interval- 
censored  data.  The  results  in  this  section  are  restricted  to  the  censoring  mechanism 
given  in  (16.13).  The  models  of  later  sections  are  sometimes  called  generalized  Tobit 
models. 

The  normalization  L  =  0  is  not  only  natural  in  many  settings,  but  some  such  nor¬ 
malization  is  necessary  for  a  linear  model  with  intercept  and  constant  threshold  pa¬ 
rameter  L.  Then  we  observe  y  if  y*  >  L,  or  equivalently  if  fJ>\  +  x(/32  +  £  >  L  or 
(fJ>i  —  L)  +  x'2/32  +  s  >  0.  Thus  only  the  difference  (jJ>\  —  L)  is  identified.  More  gen¬ 
erally,  the  latent  model  y*  =  x'/3  +  s  with  variable  censoring  threshold  L  =  x'7  is 
observationally  equivalent  to  the  latent  model  y*  =  x'(j3  —  7)  +  e  with  fixed  thresh¬ 
old  L  =  0.  These  results  are  a  consequence  of  censoring  arising  in  a  linear  model  with 
additive  error  and  do  not  carry  over  to  nonlinear  models,  such  as  the  preceding  Poisson 
example. 
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Applying  the  general  expression  (16.7)  for  the  censored  density,  here  f*(y)  is  the 
Af[x'(3,  cr2]  density  and 


F*(0)  =  Pr[v*  <  0] 

=  Pr[x'/3  +  £  <  0] 

—  1  —  <b(x'/3/er) , 


where  O(-)  is  the  standard  normal  cdf  and  the  last  equality  uses  symmetry  of  the 
standard  normal  distribution.  Thus  the  censored  density  can  be  expressed  as 


/  00  = 


\l2na 2 


exp 


-^(y-x73)2 


1  -  $ 


x'/3 


1  -d 


(16.14) 


where  the  binary  indicator  d  is  defined  in  (16.6)  with  L  =  0. 

The  Tobit  MLE  9  =  {(3  ,  a2)'  maximizes  the  censored  log-likelihood  function 
(16.8).  Given  (16.14)  this  becomes 


lnLA,(/3,  cr2)  =  \di  -  ^ln(r2  -  ^2  (?'  (16.15) 


1  =  1 


+  (l-4)ln(  1-4*1  — 

( 7 


a  mixture  of  discrete  and  continuous  densities.  The  first-order  conditions  are 


3  lnL^ 
3/3 

3  lnL^ 
da2 


£ 

i=i 

N 

£ 


1 

£ 

dj 


-  xJ/3)  -  (1  -  di) 

I  1 

^  2er2  2cr4  ) 


a  <Pi  \ 

(1  -  4>/)/ 


x,  =  0 


+  (1  -dd 


<PiX'iP  1 

(1  -0,)2a3 


(16.16) 

=  0, 


using  3<T>(z)/3z  =  </>(")  where  (/>(■)  is  the  standard  normal  pdf,  and  with  the  definitions 
0,  =  0(xJ/3/cr)  and  O,  =  <b(x'/3/<T).  As  usual  9  is  consistent  if  the  density  is  correctly 
specified,  that  is,  if  the  dgp  is  (16.11)  and  (16.12)  and  the  censoring  mechanism  is 
(16.13).  The  MLE  is  asymptotic  normal  distributed  with  variance  matrix  given  in,  for 
example,  Maddala  (1983,  p.  155)  and  Amemiya  (1985,  p.  373). 

Tobin  (1958)  proposed  ML  estimation  of  the  Tobit  model  and  asserted  that  the  usual 
ML  theory  applied.  Amemiya  (1973)  provided  a  formal  proof  that  the  usual  theory 
did  apply,  despite  the  mixed  discrete-continuous  nature  of  the  censored  density.  The 
appendix  of  this  classic  paper  of  Amemiya  details  the  asymptotic  theory  for  extremum 
estimators  presented  in  Section  5.3. 


537 


TOBIT  AND  SELECTION  MODELS 


If  data  are  truncated,  rather  than  censored,  from  below  at  zero  then  the  Tobit  MLE 

-''V  _ ~ 

9  =  (J3  ,  a")'  maximizes  the  truncated  normal  log-likelihood  function 

In  Ljv(/3,  ct2)  =  In  or2  -  Mn27r  -  ^  (y,  -xJ/3)-  -  In  4>  (xj/3/cr)  1  , 

1  =  1  * 


(16.17) 


obtained  using  (16.9)  for  y*  distributed  as  in  (16.11)  and  (16.12). 


16.3.2.  Inconsistency  of  the  Tobit  MLE 

A  very  major  weakness  of  the  Tobit  MLE  is  its  heavy  reliance  on  distributional  as¬ 
sumptions.  If  the  error  e  is  either  heteroskedastic  or  nonnormal  the  MLE  is  inconsis¬ 
tent. 

This  can  be  seen  from  the  ML  first-order  conditions  (16.16),  which  are  a  quite 
complicated  function  of  variables  including  d, ,  y, ,  0,,  and  O,  .  The  first  equation  in 
(16.16)  satisfies  E[3  lnL^/3/3]  =  0,  a  necessary  condition  for  consistency  (see  Sec¬ 
tion  5.3.7),  if 


E  [dt]  =  <L, 

E  [diyt]  =  <b,xj/3  +  o<t>i. 

These  moment  conditions  can  be  shown  to  hold  if  the  dgp  is  (16.11)  and  (16.12)  and 
the  censoring  mechanism  is  (16.13).  However,  they  are  unlikely  to  hold  under  any 
other  specification  of  the  dgp,  as  they  rely  heavily  on  both  normality  and  homoskedas- 
ticity.  For  example,  with  heteroskedastic  errors  the  estimator  is  inconsistent,  since  then 
E[d,]  =  <t>(x'/3/a,)  /  4>,  unless  err  =  a2. 

Consistent  estimation  with  heteroskedastic  normal  errors  is  possible  by  specifying 
a  model  for  heteroskedasticity,  say  of  =  cxp(z'7).  For  censoring  from  below  at  zero 
the  log-likelihood  In  L  v(/3.  7)  is  that  given  in  (16.15)  with  o2  replaced  by  cxp(z'7). 
Consistency  then  requires  normal  errors  and  correct  specification  of  the  functional 
form  of  the  heteroskedasticity. 

Clearly,  with  censoring  or  truncation,  distributional  assumptions  become  important 
even  for  distributions  somewhat  robust  to  misspecification  in  the  uncensored  or  un¬ 
truncated  case.  Specification  tests  for  the  Tobit  model  are  discussed  in  Section  16.3.7. 
In  many  censored  data  applications  the  Tobit  model  is  not  appropriate.  More  general 
models  presented  in  subsequent  sections  of  this  chapter  are  instead  used. 


16.3.3.  Censored  and  Truncated  Means  in  Linear  Regression 

Censoring  and  truncation  in  the  linear  regression  model  (16.11)  lead  to  observed  de¬ 
pendent  variable  y  that  has  distribution  with  conditional  mean  other  than  x'/3,  condi¬ 
tional  variance  other  than  a2  even  if  e  is  homoskedastic,  and  distribution  that  is  nonnor¬ 
mal  even  if  e  is  normally  distributed.  We  present  general  results  for  linear  regression 
in  this  section  before  specializing  to  normally  distributed  errors  in  Sections  16.3.4- 
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16.3.7.  The  results  provide  additional  insights  regarding  the  consequences  of  trunca¬ 
tion  and  censoring  and  form  the  basis  for  non-ML  estimation  methods  presented  in 
later  sections. 

We  begin  with  the  truncated  mean.  The  effects  of  truncation  are  intuitively  pre¬ 
dictable.  Left-truncation  excludes  small  values,  so  the  mean  should  increase,  whereas 
with  right-truncation  the  mean  should  decrease.  Since  truncation  reduces  the  range  of 
variation,  the  variance  should  decrease. 

For  left-truncation  at  zero  we  only  observe  y  if  y*  >  0.  If  we  suppress  dependence 
of  expectations  on  x  for  notational  simplicity,  the  left-truncated  mean  becomes 

E[y)  =  E[y*|y*  >  0]  (16.18) 

=  E  [x'/3  +  e|x'/3  +  e  >  0] 

=  E  [x'/3|x'/3  +  e  >  0]  +  E  [e|x'/3  +  e  >  0] 

=  x'/3  +  E  [e|e  >  —  x'/3] , 

where  the  second  equality  uses  (16.11),  and  the  last  equality  assumes  s  is  independent 
of  x.  As  expected  the  truncated  mean  exceeds  x'/3,  since  E [ s  \ s  >  c]  for  any  constant  c 
will  exceed  E[e]. 

For  data  left-censored  at  zero  suppose  we  observe  y  =  0,  rather  than  merely  that 
y*  <  0.  The  censored  mean  is  obtained  by  first  conditioning  the  observable  y  on  the 
binary  indicator  d  defined  in  (16.6)  with  L  =  0  and  then  unconditioning.  Suppressing 
dependence  on  x  for  notational  simplicity  again,  we  have  the  left-censored  mean 

E[y]=Ed[EyW[y\d]] 

=  P 'Ad  =  0]  x  E[y\d  =  0]  +  Pr [d  =  1]  x  E[y\ d  =  1] 

=  Ox  Pr[y*  <  0]  +  Pr[y*  >  0]  x  E[y*|y*  >  0]  '  ; 

=  Pr[y*  >  0]  x  E[y*\y*  >  0], 

where  Pr[y*  >  0]  =  1  —  Pr[y*  <  0]  =  Pr[e  >  —  x'/3]  is  one  minus  the  censoring 
probability  and  E[y*|y*  >  0]  is  the  truncated  mean  already  derived  in  (16.18). 

In  summary,  for  the  linear  regression  model  with  censoring  or  truncation  from  be¬ 
low  at  zero,  the  conditional  means  are  given  by 

latent  variable:  E[y*|x]  =  x'(3 

left-truncated  (at  0):  E[y|x,  y  >  0]  =  x'(3  +  E  [e|e  >  —  x'/3] ,  (16.20) 

left-censored  (at  0):  E[y|x]  =  Pr[e  >  — x'/3]  {x'/3  +  E  [e|e  >  — x'/3]}  . 

It  is  clear  that  even  though  the  original  conditional  mean  is  linear,  censoring  or  trun¬ 
cation  leads  to  conditional  means  that  are  nonlinear  so  that  OLS  estimates  will  be 
inconsistent. 

One  possible  approach  to  take  is  a  parametric  one  of  assuming  a  distribution  for  e. 
This  leads  to  expressions  for  E[e|e  >  —  x'/d]  and  Pr[g  >  —  x! (3)  and  hence  the  trun¬ 
cated  or  censored  conditional  mean.  We  do  this  in  the  next  section  for  normally  dis¬ 
tributed  errors. 
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Inverse  Mills  Ratio  as  Cutoff  Varies 


Cutoff  point  c 

Figure  16.2:  Inverse  Mills  ratio  for  the  standard  normal  distribution  as  the  censoring  or 
cutoff  point  c  increases.  Standard  normal  cdf  and  density  also  plotted. 


A  second  approach  seeks  to  avoid  or  minimize  such  parametric  assumptions.  We 
consider  this  in  a  later  section,  but  note  here  that  regardless  of  the  distribution  for  e  the 
truncated  mean  is  a  single-index  model  with  correction  term  decreasing  in  x'/3  since 
E[e|e  >  — x'/3]  is  a  monotonically  decreasing  function  in  x'f3. 

16.3.4.  Censored  and  Truncated  Means  in  the  Tobit  Model 

For  the  Tobit  model  the  regression  error  s  is  normal  and  we  use  the  following  result, 
derived  in  Section  16.10.1. 

Proposition  16.1  (Truncated  Moments  of  the  Standard  Normal):  Suppose 

Z  ~  A/"[0,  1].  Then  the  left-truncated  moments  ofz  are 

(i)  E[z|z  >  c]  =  </>(c)/ [1  -  O(c)],  andE[z\z  >  —c]  —  0(c)/ 4>(c), 

(ii)  E[z2\z  >  c]  =  1  +  c0(c)/[  1  —  O(c)],  and 

(Hi)  V[z\z  >  c]  =  1  +  c0(c)/[  1  -  < He)]  -  0(c)2/[  1  -  O(c)]2 

Result  (i)  of  Proposition  16.1  is  shown  in  Figure  16.2.  We  consider  truncation  of 
z  ~  Af\0,  1]  from  below  at  c,  where  c  ranges  from  —2  to  2.  The  lowest  curve  is  the 
standard  normal  density  0(c)  evaluated  at  c.  The  middle  curve  is  the  standard  normal 
cdf  <J>(c)  evaluated  at  c  and  gives  the  probability  of  truncation  when  truncation  is  at  c. 
This  probability  is  approximately  0.023  at  c  =  — 2  and  0.977  at  c  =  2.  The  upper  curve 
gives  the  truncated  mean  E[z|z  >  c]  =  0(c)/[l  —  <t>(c)].  As  expected  this  is  close  to 
E[z]  =  0  for  c  =  —2,  since  then  there  is  little  truncation,  and  E[z|z  >  c]  >  c.  What 
is  not  expected  a  priori  is  that  0(c)/[l  —  0(c)]  is  approximately  linear,  especially  for 
c  >  0.  Moments  when  truncation  is  from  above  can  be  obtained  using,  for  example, 
E[z|z  <  c]  =  —  E[— z\  -  z  >  — c ]  =  -0(c)/ O(c). 
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Appling  this  result  to  (16.18),  the  error  term  has  truncated  mean 

E[e|e>-x'/3]=<rE[f|f  >=££]  (16.21) 

=  a</>(-^)/[  l-0(— ^)] 

=  ^(^)/[4>(^)l 

=  ^(¥). 

where  the  second  line  uses  Proposition  16.1,  the  third  line  uses  symmetry  about  zero 
of  </>(z),  and  we  define 


Mz)=^tv  (16-22) 

4>(z) 

We  follow  the  definition  and  terminology  of  Amemiya  (1985)  and  many  others  in 
defining  X(-)  as  in  (16.22)  and  calling  it  the  inverse  Mills  ratio.  From  Johnson  and 
Kotz  (1970,  p.  278),  Mills  actually  tabulated  the  ratio  (1  —  <t>(z))/0(z)  whose  in¬ 
verse  0(z)/[  1  —  <J>(z)]  =  </>(z)/4>(— z)  is  the  hazard  function  of  the  normal  distribu¬ 
tion.  Some  authors  therefore  instead  write  (16.21)  as  E[e|e  >  — x'/3]  =  aX*(— x'/3/cr), 
where  X *(z)  =  </>(z)/<fi(— z)  is  referred  to  as  the  inverse  Mills  ratio. 

Also,  Pr[e  >  — x'/3J  =  Pr[— s  <  x'/3]  =  Pr [—s/a  <  x'(3/a]  =  4>(x' (3/a).  Then 
the  conditional  means  in  (16.20)  specialize  to 

latent  variable:  E[y*|x]  =  x'/3,  (16.23) 

left-truncated  (at  0):  E[y|x,  y  >  0]  =  x'/3  +  aX(x!  (3/cx), 
left-censored  (at  0):  E[y|x]  =  <f>(x,/3/cr)x,/3  +  a(p(x'(3/a). 

The  variance  is  similarly  obtained  (see  Exercise  16.1).  Defining  w  =  x'/J/rr,  we  have 

latent  variable:  V[y*|x]  =  er2,  (16.24) 

left-truncated  (at  0):  V[y  |x,  y  >  0]  =  cr2  [l  —  wX(w)  —  X  (in)2] , 

left-censored  (at  0):  V[y|x]  =  ct2<I>(w)  {w2  +  wX(w)  +  1  —  <t>(iy)[w  +  A.(w)]}“  . 

Clearly  truncation  and  censoring  induce  heteroskedasticity,  and  for  truncation 
V[y  |x]  <  cr2  so  that  truncation  reduces  variability,  as  expected. 

These  results  assume  normal  errors.  Maddala  (1983,  p.  369)  gives  results  similar 
to  Proposition  16.1  for  the  log-normal,  logistic,  uniform,  Laplace,  exponential,  and 
gamma  distributions. 


16.3.5.  Marginal  Effects  in  the  Tobit  Model 

The  marginal  effect  is  the  effect  on  the  conditional  mean  of  the  dependent  variable 
of  changes  in  the  regressors.  This  effect  varies  according  to  whether  interest  lies  in 
the  latent  variable  mean  x'/J  or  the  truncated  or  censored  means  given  in  (16.23). 
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Differentiating  each  with  respect  to  x  yields 

latent  variable:  3E[y*|x]/3x  =  (3,  (16.25) 

left-truncated  (at  0):  3E[y,  y  >  0|x]/3x  =  {1  —  wX(w )  —  X(w)2}(3, 
left-censored  (at  0):  3E[y|x]/3x  =  <J>(w)/3, 

where  w  =  x'(3/a  and  we  use  3<b(z)/3z  =  </>(<:)  and  d(j)(z)/dz  =  — z0(z).  The  sim¬ 
ple  expression  for  the  censored  mean  is  obtained  after  some  manipulation.  It  can  be 
decomposed  into  two  effects,  one  for  v  =  0  and  one  for  y  >  0  (see  McDonald  and 
Moffitt,  1980). 

In  some  cases  truncation  or  censoring  is  just  an  artifact  of  data  collection,  so  the 
truncated  and  censored  means  are  of  no  intrinsic  interest  and  we  are  interested  in 
3E[y*|x]/3x  =  (3.  For  example,  with  top-coded  earnings  data  we  are  clearly  inter¬ 
ested  in  measuring  the  effect  of  schooling  on  mean  earnings  rather  than  earnings  of 
those  not  top-coded. 

In  other  cases  truncation  or  censoring  has  behavioral  implications.  In  a  model  for 
hours  worked,  for  example,  the  three  marginal  effects  in  (16.25)  correspond  to  the 
effect  of  a  change  in  a  regressor  on,  respectively,  (1)  desired  hours  of  work,  (2)  actual 
hours  of  work  for  workers,  and  (3)  actual  hours  of  work  for  workers  and  nonworkers. 
For  (1)  we  clearly  need  an  estimate  of  (3,  but  for  (2)  and  (3)  OLS  slope  coefficients, 
although  inconsistent  for  (3,  may  actually  provide  a  reasonable  crude  estimate  of  the 
marginal  effect  since  the  truncated  and  censored  means  are  still  fairly  linear  in  x. 

16.3.6.  Alternative  Estimators  for  the  Tobit  Model 

In  addition  to  the  MLE,  consistent  estimation  is  possible  by  NLS  based  on  the  correct 
expression  for  the  truncated  or  censored  mean.  We  consider  the  NLS  estimator  and 
other  least-squares  estimators. 


NLS  Estimator 

The  results  in  (16.23)  can  be  used  to  permit  consistent  estimation  of  the  Tobit  model 
parameters  by  NLS.  For  example,  with  truncated  data  we  minimize 

N 

SN(f3 ,  or2)  =  ^  (yi  -  X-/3  -  crX(x'if3/ cr )) - 

i= 1 

with  respect  to  both  (3  and  a2,  but  then  perform  inference  controlling  for  the  het- 
eroskedasticity  given  in  (16.24).  A  similar  estimator  can  be  obtained  for  censored  data. 

This  estimator  is  not  used  in  practice.  Consistency  requires  correct  specification  of 
the  truncated  mean,  which  from  (16.21)  requires  both  normality  and  homoskedasticity 
of  the  errors.  One  might  as  well  estimate  by  ML  since  this  relies  on  assumptions  just  as 
strong  and  is  fully  efficient.  Moreover,  in  practice  the  NLS  estimator  can  be  imprecise. 
From  Figure  16.2  it  is  clear  that  A .(x! (3 /o)  is  approximately  linear  in  x'(3/a,  leading 
to  near  collinearity  because  x  is  also  a  regressor.  In  Section  16.5  we  consider  models 
that  permit  correction  terms  similar  to  o\(x! (3/o)  in  (16.23)  that  have  the  advantage 
of  depending  in  part  on  regressors  other  than  those  in  x. 
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Heckman  Two-Step  Estimator 

From  (16.23)  the  truncated  (at  zero)  mean  is 

E[y|x]  =  x'/3  +  ak(x'/3/ff).  (16.26) 

Rather  than  use  NLS,  this  can  be  estimated  in  the  following  two-step  procedure  if 
censored  data  are  available.  First,  for  the  full  sample  do  probit  regression  of  d  on  x, 
where  the  binary  variable  d  equals  one  if  y  >  0  is  observed,  to  give  consistent  estimate 
a,  where  a  =  (3/cr.  Second,  for  the  truncated  sample  do  OLS  regression  of  v  on  x  and 
X(x'a)  to  give  consistent  estimates  of  (3  and  a. 

This  estimation  procedure,  due  to  Heckman  (1976,  1979),  is  presented  in  Sec¬ 
tion  16.5.4  where  it  is  applied  to  the  more  general  sample  selection  model.  Section 
16.10.2  derives  the  standard  error  of  (3  that  accounts  for  the  regressor  k(x'a)  depend¬ 
ing  on  estimated  parameters  and  for  heteroskedasticity  induced  by  truncation. 


OLS  Estimation  of  the  Tobit  Model 

The  OLS  estimates  using  censored  or  truncated  data  are  inconsistent  for  (3.  This  is  be¬ 
cause  the  censored  and  truncated  means  given  in  (16.23)  are  not  equal  to  x'/3,  violating 
the  essential  condition  for  consistency  of  OLS. 

For  censored  data,  OLS  provides  a  linear  approximation  to  the  nonlinear  censored 
regression  curve.  It  is  clear  from  Figure  16. 1  and  (16.25)  that  this  line  is  flatter  than  the 
regression  line  for  uncensored  data,  which  has  slope  equal  to  the  true  slope  parameter. 
Goldberger  (1981)  showed  analytically  that  if  y  and  x  are  joint  normally  distributed 
and  there  is  censoring  from  below  at  zero,  then  the  OLS  slope  parameters  converge 
to  p  times  the  true  slope  parameter,  where  p  is  the  fraction  of  the  sample  with  posi¬ 
tive  values  of  y.  These  conditions  are  restrictive  but  were  relaxed  somewhat  by  Ruud 
(1986).  In  practice  this  proportionality  result  provides  a  good  empirical  approximation 
to  the  inconsistency  of  OLS  if  a  Tobit  model  is  instead  appropriate. 

Similarly,  with  truncation  the  regression  line  is  flatter  than  the  untruncated  regres¬ 
sion  line.  Goldberger  (1981)  obtained  an  analytical  result  similar  to  that  for  the  cen¬ 
sored  case.  If  y  and  x  are  joint  normally  distributed  and  there  is  censoring  from  below 
at  zero,  then  the  OLS  slope  parameters  converge  to  a  multiple  of  the  true  slope  pa¬ 
rameter.  The  multiple,  the  expression  for  which  is  quite  lengthy,  lies  between  zero  and 
one,  and  the  shrinkage  is  the  same  for  all  slope  coefficients.  Truncated  OLS  therefore 
understates  the  absolute  magnitude  of  the  true  slope  parameters. 


16.3.7.  Specification  Tests  for  the  Tobit  Model 

Given  the  fragility  of  the  Tobit  model  it  is  good  practice  to  test  for  distributional  mis- 
specification.  There  are  four  broad  strategies. 

The  first  approach  is  to  nest  the  Tobit  model  within  a  richer  parametric  model  and 
apply  a  Wald,  LR,  or  LM  test.  Since  the  null  hypothesis  model,  the  Tobit  model,  is 
most  easily  estimated  it  is  natural  to  use  LM  tests.  This  is  particularly  straightfor¬ 
ward  for  testing  against  heteroskedasticity  of  the  form  af  =  exp(x-ai)  in  the  censored 
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regression  model.  Using  the  OPG  form  of  the  LM  test  (see  Section  7.3.5)  we  com¬ 
pute  N  times  the  uncentered  R2  from  auxiliary  regression  of  1  on  s),  and  s2i,  where 
fi  =  /(vilx,,  /3,  cr)  is  the  density  given  in  (16.14)  with  a  replaced  by  exp(x'a),  the 
expressions  for  S|,  =  9  In  fi/d/3  and  =  9  In  f) /3a  are  obtained  by  minor  adapta¬ 
tion  of  the  expressions  in  (16.16),  and  tilde  denotes  evaluation  at  the  censored  Tobit 
MLE  with  all  components  of  a  except  that  for  the  intercept  equal  to  zero.  A  similar 
approach  for  testing  the  assumption  of  normally  distributed  errors  is  more  difficult  as 
there  is  no  standard  generalization  of  the  normal. 

A  second  approach  is  to  use  conditional  moment  tests  (see  Section  8.2)  that  do  not 
require  specification  of  an  alternative  hypothesis  model.  In  particular,  the  first-order 
conditions  (16. 16)  for  the  censored  Tobit  MLE  suggest  conditional  moment  tests  based 
on  the  generalized  residual 


e,  =  di 


y>  -x’iP 


(1  -  di) 


<Pi 


a(l  -  4>,) 


If  the  Tobit  model  is  correctly  specified  then  E[e,  |x,]  =  0  since  the  regularity  con¬ 
ditions  imply  that  E[9  In /(y,)/9/3]  =  0.  Then  we  can  implement  an  m-test  of  Hq  : 
E[ez]  =  0  against  Ha  :  E[ez]  /  0  using  A-1  e,z,.  where  e)  =  <?,  evaluated  at 

the  Tobit  MLE  (/ 3, a 2).  From  Section  8.2.2  this  test  can  be  implemented  by  comput¬ 
ing  N  times  the  uncentered  R1  from  auxiliary  regression  of  1  on  c,z,,  s),,  and  Ur, 
where  ft  =  /(y,-|x,-,  /3,  a2)  is  the  density  given  in  (16.14)  and  Su  =  din  fi/d/3  and 
s2i  =  9  In  fi/'da2  given  in  (16.16)  are  evaluated  at  (]3,  cr2).  The  variables  z,  may  be 
variables  other  than  x,  ,  in  which  case  the  test  can  be  interpreted  as  a  test  of  omitted  re¬ 
gressors,  or  powers  of  the  components  of  x(  .  Conditional  moment  tests  based  on  higher 
order  moments  have  also  been  developed.  For  details  see  Chesher  and  Irish  (1987)  and 
Pagan  and  Vella  (1989). 

A  third  approach  is  to  adapt  some  of  the  diagnostic  and  testing  methods  developed 
for  right-censored  duration  data  (see  Chapter  19)  to  left-censored  normally  distributed 
data. 

A  final  approach  contrasts  the  Tobit  MLE  / 3  with  alternative  estimates  of  (3,  no¬ 
tably  the  semiparametric  estimates  presented  in  Section  16.9,  that  are  consistent  under 
weaker  distributional  assumptions. 

For  further  details  see  Pagan  and  Vella  (1989),  who  present  theory  with  some  ap¬ 
plication,  and  Melenberg  and  Van  Soest  (1996),  who  provide  a  more  complete  appli¬ 
cation.  Both  papers  consider  specification  tests  for  the  richer  sample  selection  model 
(see  Section  16.5)  in  addition  to  those  for  the  Tobit  model. 


16.4.  Two-Part  Model 

The  preceding  models  for  censored  data  restrict  the  censoring  mechanism  to  be  from 
the  same  model  as  that  generating  the  outcome  variable.  More  generally,  the  censoring 
mechanism  and  outcome  may  be  modeled  using  separate  processes.  For  example,  in 
explaining  individual  annual  hospital  expenses  one  process  may  determine  hospital¬ 
ization  and  a  second  process  may  explain  consequent  hospital  expenses.  The  case  for 
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postulating  two  separate  mechanisms  is  strong  if  there  is  compelling  reason  to  believe 
that  certain  realized  values  occur  with  too  large  or  too  small  a  frequency  than  is  con¬ 
sistent  with  a  simpler  model.  For  example,  one  might  observe  many  more  zeros  than  is 
consistent  with,  for  example,  the  Poisson  distribution.  A  two-part  model  that  permits 
the  zeros  and  non-zeros  to  be  generated  by  different  densities  adds  flexibility.  Indeed 
it  is  a  specific  type  of  mixture  model. 

There  are  two  approaches  to  such  generalization.  The  two-part  model,  given  in  this 
section,  specifies  a  model  for  the  censoring  mechanism  and  a  model  for  the  outcome 
conditional  on  the  outcome  being  observed.  The  sample  selection  model,  presented  in 
the  subsequent  section,  instead  specifies  a  joint  distribution  for  the  censoring  mecha¬ 
nism  and  outcome,  and  then  finds  the  implied  distribution  conditional  on  the  outcome 
observed.  These  approaches  are  contrasted  in  Section  16.5.7. 


16.4.1.  Two-Part  Model 


Let  an  individual  with  fully  observed  outcome  be  called  a  participant  in  the  activity 
being  studied.  Define  a  binary  indicator  variable  d  =  1  for  participants  and  d  =  0  for 
nonparticipants.  Suppose  that  y  >  0  is  observed  for  participants  and  y  =  0  is  observed 
for  nonparticipants.  For  nonparticipants  we  observe  only  Pr[d  =  0].  For  participants 
the conditional  density  of  y  given  y  >  0  is  specified  to  be f(y\d  =  1),  for  some  choice 
of  density  /(•)■  The  two-part  model  for  y  is  then  given  by 


/(.y|x)  = 


Pr[d  =  0|x]  if  >■  =  0, 

Pr [d  =  1  \x]f(y\d  =  1 ,  x)  if  y  >  0. 


(16.27) 


This  model  was  presented  in  detail  by  Cragg  (1971)  as  a  generalization  of  the  Tobit 
model,  which  can  be  presented  as  a  special  case  of  (16.27).  An  obvious  model  for  the 
participation  decision  d  is  a  probit  or  logit  model.  A  latent  variable  formulation  is  that 
d  =  1  if  I  =  x'/3  +  s  exceeds  zero,  and  the  model  is  then  viewed  as  a  hurdle  model 
since  crossing  a  hurdle  or  threshold  leads  to  participation.  To  ensure  positive  values  for 
the  participants,  the  density  f(y\d  =  1,  x)  should  be  that  for  a  positive- valued  random 
variable,  such  as  the  log-normal,  or  an  appropriate  density  such  as  the  normal  truncated 
from  below  at  zero. 

For  simplicity  the  same  regressors  usually  appear  in  both  parts  of  the  model,  but 
this  can  be  relaxed  and  should  be  if  there  are  obvious  exclusion  restrictions.  Maximum 
likelihood  estimation  is  straightforward  as  it  separates  into  estimation  of  a  discrete 
choice  model  using  all  observations  and  estimation  of  the  parameters  of  the  density 
f(y\d  =  1,  x)  using  only  observations  with  y  >  0. 


16.4.2.  Two-Part  Model  Examples 

Duan  et  al.  (1983)  present  a  leading  application  of  this  model  to  forecasting  medi¬ 
cal  expenses  using  data  from  the  Rand  Health  Insurance  Experiment.  They  specified 
a  probit  model  for  whether  or  not  any  medical  expenses  were  incurred  during  the 
year,  so  Pr[d  =  l|x]  =  <t>  (xj /T , ),  and  a  log-normal  model  for  medical  expenses  given 
that  some  expenses  were  incurred,  so  \ny\d  =  1.  x  ~  AfiXnfti-  ^ I-  Then  expected 
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medical  expenses  over  the  entire  population  are  given  by 

E[y|x]  =  4>  (xj/3,)  exp[cr|/2  +  x'2(32\,  (16.28) 

where  the  second  term  uses  the  result  that  if  In  y  ~  cr2]  then  E[y]  =  exp(/r  + 
ct2/2).  Mullahy  (1998)  considers  such  retransformation  in  further  detail. 

Two-part  models  are  especially  popular  for  modeling  count  data.  For  example,  in 
modeling  the  number  of  doctor  visits  there  is  one  model  to  determine  whether  or  not 
a  patient  visits  a  physician  at  all  and  a  second  model  to  determine  the  consequent 
number  of  visits  for  those  with  at  least  one  visit.  Then  Pr[d  =  1]  is  specified  to  be 
the  probability  that  a  Poisson  or  negative  binomial  variable  exceeds  zero,  whereas  the 
density  f(y\d  =  1)  is  specified  to  be  a  Poisson  or  negative  binomial  density  truncated 
from  below  at  zero.  This  model,  due  to  Mullahy  (1986),  is  called  a  hurdle  model  in  the 
count  literature  and  is  detailed  in  Section  20.4.5. 

For  continuous  data  two-part  models  are  used  for  expenditure  models  with  excess 
zeros  (Cragg’s  original  motivation).  An  alternative,  a  sample  selection  model,  is  pre¬ 
sented  next. 


16.5.  Sample  Selection  Models 

Sample  selection  can  arise  in  many  setttings  and  so  there  are  many  sample  selection 
models.  This  section  begins  with  a  general  discussion  of  sample  selection  before  focus¬ 
ing  on  a  leading  example,  the  bivariate  sample  selection  model  studied  by  Heckman 
(1979).  Another  leading  example,  the  Roy  model,  is  treated  separately  in  Section  16.7. 


16.5.1.  Sample  Selection  Models 

Observational  studies  are  rarely  based  on  pure  random  samples.  Most  often  exogenous 
sampling  is  used  (see  Section  3.2.4)  and  the  usual  estimators  can  be  applied.  If  instead 
a  sample,  intentionally  or  unintentionally,  is  based  in  part  on  values  taken  by  a  depen¬ 
dent  variable,  parameter  estimates  may  be  inconsistent  unless  corrective  measures  are 
taken.  Such  samples  can  be  broadly  defined  as  selected  samples. 

There  are  many  selection  models,  since  there  are  many  ways  that  a  selected  sample 
may  be  generated.  Indeed  it  is  very  easy  to  be  unaware  that  a  selected  sample  is  being 
used.  For  example,  consider  interpretation  of  average  scores  over  time  on  an  achieve¬ 
ment  test  such  as  the  Scholastic  Aptitude  Test,  when  test  taking  is  voluntary.  A  decline 
over  time  may  be  due  to  real  deterioration  in  student  knowledge.  However,  it  may  just 
reflect  the  selection  effect  that  relatively  more  students  have  been  taking  the  test  over 
time  and  the  new  test  takers  are  the  relatively  weaker  students. 

Selection  may  be  due  to  self-selection,  with  the  outcome  of  interest  determined  in 
part  by  individual  choice  of  whether  or  not  to  participate  in  the  activity  of  interest. 
It  can  also  result  from  sample  selection,  with  those  who  participate  in  the  activity  of 
interest  deliberately  oversampled  -  an  extreme  case  being  sampling  only  participants. 
In  either  case,  similar  issues  arise  and  selection  models  are  usually  called  sample  se¬ 
lection  models. 
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This  chapter  presents  only  three  of  the  many  selection  models  in  the  literature.  The 
simplest  model  is  the  Tobit  model  already  presented  in  Section  16.3.  A  prototypical 
commonly  used  model  that  we  call  the  bivariate  sample  selection  model  is  presented  in 
the  remainder  of  this  section.  This  model  generalizes  the  Tobit  model  by  introducing 
a  censoring  latent  variable  that  differs  from  the  latent  variable  generating  the  outcome 
of  interest.  Another  popular  model  called  the  Roy  model  is  presented  in  Section  16.7. 
This  model  considers  an  outcome  that  takes  one  of  two  values  depending  on  the  value 
taken  by  a  censoring  random  variable.  These  models  correspond  to,  respectively,  the 
Tobit  model  types  1,  2,  and  5  in  the  terminology  of  Amemiya  (1985,  p.  384). 

Consistent  estimation  in  the  presence  of  sample  selection  on  unobservables  relies 
on  relatively  strong  distributional  assumptions,  even  in  the  case  of  semiparametric  es¬ 
timation.  Experimental  data  studies  provide  an  attractive  alternative  as  selection  prob¬ 
lems  can  then  be  avoided  by  random  assignment.  However,  experiments  can  be  diffi¬ 
cult  to  implement  in  economics  applications  for  cost  and  ethical  reasons.  The  treatment 
effects  approach,  detailed  in  Chapter  25,  seeks  to  apply  the  experimental  approach  to 
observational  data. 


16.5.2.  A  Bivariate  Sample  Selection  Model  (Type  2  Tobit) 


Let  y|  denote  the  outcome  of  interest.  In  the  standard  truncated  Tobit  model  this 
outcome  is  observed  if  y*  >  0.  A  more  general  model  introduces  a  different  latent 
variable,  y*,  and  the  outcome  y*  is  observed  if  y*  >  0.  For  example,  y*  determines 
whether  or  not  to  work  and  y2  determines  how  much  to  work,  and  y*  ^  y|  since  there 
are  fixed  costs  to  work  such  as  commuting  costs  that  are  more  important  in  determining 
participation  than  hours  of  work  once  working. 

The  bivariate  sample  selection  model  comprises  a  participation  equation  that 


_jl  if  y*  >  0, 

■Vl_jo  if  y*  <  0 

and  a  resultant  outcome  equation  that 

U*  if  yr  >  o 
'  1-  if  y*  <  0. 


(16.29) 


(16.30) 


This  model  specifies  that  V2  is  observed  when  y*  >  0,  whereas  y2  need  not  take  on 
any  meaningful  value  when  y*  <  0.  The  standard  model  specifies  a  linear  model  with 
additive  errors  for  the  latent  variables,  so 


y^x'tfr+eu  (16.31) 

yj  =  x'2/32  +  s2, 

with  problems  arising  in  estimating  (32  if  £\  and  S2  are  correlated.  The  Tobit  model  is 
clearly  the  special  case  where  y*  =  y|. 

There  is  no  generally  accepted  name  for  this  model.  Heckman  (1979)  used  it  to 
illustrate  estimation  given  sample  selection.  The  model  is  equivalent  to  a  Tobit  model 
with  stochastic  threshold  (Nelson,  1977).  Suppose  we  observe  y*  if  y*  >  L*,  where 
y|  is  defined  as  in  (16.31)  and  the  threshold  is  L*  =  z'7  +  i>  rather  than  L*  =  0  in 
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Section  16.3.  Then,  equivalently,  we  observe  y*  if  y*  >  0,  where  y*  =  y*  —  L*  = 
( x'2/32  —  z'7)  +  (e2  —  v)  =  x,1/31  +  s  1  and  where  xi  denotes  the  union  of  x2  and  z,  and 
(3{  and  £1  are  defined  in  an  obvious  manner.  Amemiya  (1985,  p.  384)  calls  the  model 
a  type  2  Tobit  model.  Wooldridge  (2002,  p.  506)  calls  the  model  one  with  a  probit 
selection  equation.  Others  call  this  model  the  generalized  Tobit  model  or  the  sample 
selection  model,  though  there  are  many  such  models. 

Estimation  by  ML  is  straightforward  given  the  additional  assumption  that  the  cor¬ 
related  errors  are  joint  normally  distributed  and  homoskedastic,  with 


’fit" 

f 

"0" 

1  <72 1" 

.£2_ 

.0. 

_  <72  <7  J. 

As  for  the  probit  model  in  Section  14.4.1,  the  normalization  a\  =  1  is  used  since  only 
the  sign  of  y *  is  observed. 

Given  (16.29)  and  (16.30),  for  _y*  >  0  we  observe  y|,  with  probability  equal  to 
the  probability  that  y*  >  0  times  the  conditional  probability  of  y|  given  that  y*  >  0. 
Thus  for  positive  y2  the  density  of  the  observables  is  /*( y||y*  >  0)  x  Pr[y*  >  0]. 
For  y(  <  0  all  that  is  observed  is  that  this  event  has  occurred,  and  the  density  is  the 
probability  of  this  event  occurring.  The  bivariate  sample  selection  model  therefore  has 
likelihood  function 

n 

L  =  n  {Pr^u  <  0|}‘  Vli  { /(>’2l  |  y*u  >  0)  x  Pr[y1(*>  0] } Vl'  ,  (16.33) 

(■= 1 

where  the  first  term  is  the  discrete  contribution  when  y*.  <  0,  since  then  y  1 ,  =  0,  and 
the  second  term  is  the  continuous  contribution  when  y*;  >  0.  This  likelihood  function 
is  applicable  to  quite  general  models,  not  just  linear  models  with  joint  normal  errors. 

Specializing  to  linear  models  with  joint  normal  errors  gives  a  bivariate  density 
f*(y*,  yf)  that  is  normal,  leading  to  a  conditional  density  in  the  second  term  that  is 
univariate  normal  and  easily  handled.  Amemiya  (1985,  pp.  385-387)  provides  details, 
including  the  exact  form  of  the  likelihood  function. 

The  classic  early  application  of  this  model  was  to  labor  supply,  where  y*  is  the  un¬ 
observed  desire  or  propensity  to  work,  whereas  y2  is  actual  hours  worked.  The  model 
is  also  conceptually  more  appealing  for  labor  supply  than  the  Tobit  model  in  Section 
14.2.1  which  required  the  artifice  of  “desired”  hours  of  work.  This  prototypical  ap¬ 
plication  does  have  the  complication  that  data  on  a  key  regressor,  the  offered  wage, 
is  missing  for  those  individuals  who  do  not  work.  This  complication  is  handled  by 
adding  an  equation  for  the  offered  wage  and  substituting  this  in,  though  the  model  is 
then  strictly  speaking  not  just  a  bivariate  sample  selection  model.  See  Mroz  (1987)  for 
an  excellent  application  to  labor  supply. 


16.5.3.  Conditional  Means  in  the  Bivariate  Sample  Selection  Model 

In  this  section  we  obtain  the  conditional  truncated  mean  in  the  bivariate  sample  selec¬ 
tion  model.  It  differs  from  x2/32,  so  that  OLS  regression  of  y2  on  x2  leads  to  inconsis¬ 
tent  parameter  estimates.  Nonetheless,  the  expression  for  the  conditional  mean  can  be 
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used  to  motivate  an  alternative  estimation  procedure  given  in  the  subsequent  section 
that  relies  on  weaker  distributional  assumptions  than  those  of  the  MLE. 

We  consider  the  truncated  mean  in  the  sample  selectivity  model  where  only  positive 
values  of  _y2  are  used.  In  general  this  is 

E[y2|x,y*  >  0]  =  E[x'/32  +  +  £l  >  0] 

=  x'/32+E[e2|ei  >— x^],  1  • 

where  x  denotes  the  union  of  Xi  and  x2.  If  the  errors  S\  and  e2  are  independent  then 
the  last  term  simplifies  to  E[e2]  =  0,  and  OLS  regression  of  v2  on  x2  will  give  a  con¬ 
sistent  estimate  of  /32.  However,  any  correlation  between  the  two  errors  means  that  the 
truncated  mean  is  no  longer  x'2f32  and  we  need  to  account  for  selection. 

To  obtain  E[e2|ei  >  —xj/3,  ]  when  ei  and  e2  are  correlated,  Heckman  (1979)  noted 
that  if  the  errors  (ei,  e2)  in  (16.31)  are  joint  normal  as  in  (16.32)  then  Equation  (16.36) 
in  the  following  implies  that 

£2  =  <7i2£l+£,  (16.35) 

where  the  random  variable  §  is  independent  of  s\.  To  obtain  this  result,  note  that  in 
general  the  joint  normal  distribution 
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implies  the  conditional  normal  distribution 

z2|zi  ~  N  [/r2  +  EdiEj/lzi  —  /Xj),  E22  —  EtiEjj1  Ei2]  , 
a  result  that  implies  that 

z2  =  A*2  +  E2iEf/(zi  -  /xj)  +  £,  (16.36) 

where  £  ~  Af[0,  E22  —  E2]  EM'  Ej2]  is  independent  of  Z\.  For  the  joint  density  given 
in  (16.32)  we  have  scalars  and  fi\  =  /x2  =  0  and  of  =  1,  so  (16.36)  specializes  to 
(16.35). 

By  using  (16.35),  the  truncated  mean  (16.34)  becomes 

E[y2|x,y*  >  0]  =  x;/32  +  E  [(or12£!  +£)|£i  >  -xj/3,] 

=  x2/32  +  cti2E[£i|£i  >  xj /3j ] , 

where  we  use  independence  of  f  and  s\.  The  selection  term  is  similar  to  that  in  the 
simpler  Tobit  model  and  again  using  the  expression  for  E[z|z  >  — c]  in  Proposition 
16.1  we  obtain 

E[y2|x,y*  >  0]  =  x'2/32  +  a12k  (xj/3,) ,  (16.37) 

where  \(z)  =  0(z)/<t>(z)  and  we  have  used  of  =  1.  Similarly,  Proposition  16.1  (iii) 
yields  the  truncated  variance 

V[y2|x,y*  >  0]  =  of  -  of2X  (xjftXxjfr  +  A  (xj/3j)).  (16.38) 

The  preceding  analysis  specifies  no  value  for  y2  when  yf  <  0.  In  some  applications 
y2  may  equal  zero  when  y*  <  0.  Then  it  is  meaningful  to  consider  the  censored  mean. 
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Conditioning  the  observable  _y2  on  the  unobservables  y*  and  v*  and  then  uncondition¬ 
ing  yields 

E[y2|x]  =  Evr[E[y2|x,y*]] 

=  M)’*  <  0|x]  x  0  +  Pr[y*  >  0|x]  x  E[y2*|x,y2*  >0] 

=  0+ <b(x,1/31){x,2^2 +cti2A.(x,1/31)} 

=  O(x,1/3i)x2^2+(Ti20(x,1^1), 

where  the  third  line  uses  (16.37)  and  the  last  line  uses  X  (z)  =  (j){z)/ ^iz)-  The  censored 
variance  can  be  shown  to  be  heteroskedastic. 


16.5.4.  Heckman  Two-Step  Estimator 

An  important  result  is  that  OLS  regression  of  y2  on  x2  alone  using  just  the  observed 
positive  values  of  y2  leads  to  inconsistent  estimation  of  (3  unless  the  errors  are  uncor¬ 
related  so  that  cr12  =  0.  This  is  clear  from  the  truncated  mean  formula  (16.37),  which 
additionally  includes  the  “regressor”  Alxj/Jj). 

Heckman’s  two-step  procedure,  sometimes  called  the  Heckit  estimator,  aug¬ 
ments  the  OLS  regression  by  an  estimate  of  the  omitted  regressor  AIx'^).  Thus  using 
positive  values  of  y2  estimate  by  OLS  the  model 

y2i  =  x2i/32  +  cTnk(xuj3f)  +  m ,  ( 16.40) 

where  v  is  an  error  term,  (3l  is  obtained  by  first-step  probit  regression  of  v |  on  Xi  since 
Pr[y*  >  0]  =  Olxj/Sj),  and  Alxj/Jj)  =  (^(xj^jf/Olxj^j)  is  the  estimated  inverse 
Mills  ratio.  This  regression  does  not  directly  provide  an  estimate  of  <t2  ,  but  the  trun¬ 
cated  variance  formula  (16.38)  leads  to  estimate  a\  =  N  1  [vf  +  (Tp/,,ixj /31  + 
A.,)],  where  T,  is  the  OLS  residual  from  (16.40)  and  A,  =  A('xjl/3] ).  The  correlation 
between  the  two  errors  in  (16.32)  can  then  be  estimated  by  p  =  erj 2/ef2. 

A  test  of  whether  or  not  cti2  =  0  or  p  =  0  is  a  test  of  whether  or  not  the  errors  are 
correlated  and  sample  selection  correction  is  needed.  One  such  test  is  a  Wald  test  based 
on  <7 12,  the  estimated  coefficient  of  the  inverse  Mills  ratio. 

It  is  important  to  note  that  both  the  usual  OLS  standard  errors  and 
heteroskedasticity-robust  standard  errors  reported  from  the  regression  (16.40)  are  in¬ 
correct.  Correct  formulas  for  the  standard  errors  take  account  of  two  complications 
in  the  second-stage  regression.  First,  even  if  f3l  were  known,  the  error  in  (16.40)  is 
heteroskedastic  from  (16.38).  Second,  in  fact  /3l  is  replaced  by  an  estimate,  a  com¬ 
plication  studied  in  Section  6.6  and  analyzed  in  Section  16.10.2  for  the  simpler  Tobit 
model.  Formulas  for  the  correct  standard  errors  are  given  in  Heckman  (1979);  see  also 
Greene  (1981).  Section  16.10.2  derives  these  formulas  for  the  simpler  Tobit  model. 
Implementation  is  not  simple  so  it  is  best  to  use  a  package  that  automatically  handles 
this  complication  or  to  use  the  bootstrap. 

The  resulting  estimator  of  (32  is  consistent.  Despite  an  efficiency  loss  compared  to 
the  MLE  under  joint  normality  of  the  errors  that  can  be  quite  large,  the  estimator  is 
very  popular  for  the  following  reasons:  (1)  It  is  simple  to  implement;  (2)  the  approach 
is  applicable  to  a  range  of  selection  models  including  those  given  in  Section  16.7; 
(3)  the  estimator  requires  distributional  assumptions  weaker  than  joint  normality  of  8\ 
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and  £2;  and  (4)  these  distributional  assumptions  can  be  weakened  even  further  to  permit 
semiparametric  estimation  as  in  Section  16.9. 

The  key  assumption  needed  is  (16.35),  essentially  that 


e2  =  Ss  1+$,  (16.41) 

where  £  is  independent  of  S\ .  This  seems  to  be  a  quite  sensible  model.  In  the  case  of 
expenditures  on  a  durable  good,  say,  this  says  that  the  error  in  the  expenditure  equation 
is  a  multiple  of  the  error  in  the  purchase  decision  equation,  plus  some  noise  that  is 
independent  of  the  purchase  decision;  essentially  a  linear  regression  model  for  the 
errors.  Given  assumption  (16.41)  the  conditional  mean  (16.34)  becomes 


E[y2\y;  >  0]  =  x'2f32  +  SE[Sl|ei  >  -x',/3,].  (16.42) 

If  £1  is  standard  normal  distributed  this  leads  to  (16.37),  the  basis  for  the  OLS  regres¬ 
sion  (16.40). 

More  generally,  Heckman’s  two-step  method  can  be  applied  to  (16.42)  with  distri¬ 
butions  for  £1  other  than  normal;  see,  for  example,  Olsen  (1980).  One  can  also  use 
semiparametric  methods  that  do  not  impose  a  functional  form  for  E[£i  | £1  >  — 

(see  Section  16.9). 


16.5.5.  Identification  Considerations 

The  bivariate  sample  selection  model  with  normal  errors  is  theoretically  identified 
without  any  restriction  on  the  regressors.  In  particular,  exactly  the  same  regressors 
can  appear  in  the  equations  for  y*  and  y2 . 

The  model  with  normally  distributed  errors  is  close  to  unidentified,  however,  if  ex¬ 
actly  the  same  regressors  are  used.  If  xi  =  x2  then  E[v2|y*  >  0]  ~  x2 (32  +  a  +  6x2/3| , 
using  (16.37)  and  the  observation  from  Section  16.3.2  that  the  inverse  Mills  ratio  term 
X  (•)  is  approximately  linear  over  a  wide  range  of  its  argument.  This  leads  to  obvi¬ 
ous  multicollinearity  problems,  discussed  in  many  articles  including  those  by  Nawata 
(1993),  Nawata  and  Nagase  (1996),  and  Leung  and  Yu  (1996).  Multicollinearity  can 
be  detected  using  the  condition  number  given  in  Section  10.4.2,  where  from  (16.40) 
the  regressors  are  x2  and  A.(x,1/31).  The  problem  is  less  severe  the  greater  the  variation 
in  X|/3|  across  observations,  that  is,  the  better  a  probit  model  can  discriminate  between 
participants  and  nonparticipants. 

Semiparametric  variants  of  the  Heckman  two-step  method  (see  Section  16.9.3)  do 
require  an  exclusion  restriction.  So  identification  in  the  bivariate  sample  selection 
model  with  normal  errors  is  being  achieved  by  functional  form  assumptions. 

For  practical  purposes  therefore,  estimation  of  the  bivariate  sample  selection  model 
may  require  that  at  least  one  regressor  in  the  participation  equation  (y*)  be  excluded 
from  the  outcome  equation  (y|).  For  example,  fixed  costs  of  working  unrelated  to 
hours  worked  will  affect  the  decision  to  work  but  not  hours  worked.  This  can  be  a 
great  limitation  as  in  many  applications,  such  as  that  in  Section  16.6,  it  can  be  very 
difficult  to  make  defensible  exclusion  restrictions. 
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16.5.6.  Marginal  Effects 

The  marginal  effects  in  the  bivariate  sample  selection  model  vary  according  to  whether 
we  consider  the  latent  variable  mean  or  the  truncated  mean  given  in  (16.37)  or  the 
censored  mean  (if  it  is  appropriate). 

It  is  convenient  to  define  x  to  be  the  vector  formed  by  union  of  xj  and  x2  and 
rewrite  x,1/31  as  x'7,  and  x'2/32  as  x'79.  For  example,  the  truncated  mean  becomes 
E [ >’2 1 x  |  =  x'72  +  rT|2/-(x'7| ).  Note  that  7,  and/or  72  will  have  some  zero  entries  if 
X]  ^  x2.  Differentiating  with  respect  to  x  yields  the  marginal  effects 

uncensored:  3E[y?|x]/3x  =  72,  (16.43) 

truncated  (at  0):  3E[y2|x,ji  =  l]/3x  =  72-<ti2A.(x'71)(x'71+A(x,71)) 
censored  (at  0):  3E[y2|x]/3x  =  71</)(x,71)x,72  +  <f>(x'7| )72 

-o'i2X,710(x,71)71) 

where  A(z)  =  </>(z)/lt>(z),  and  we  use  dcj)(z)/dz  =  —  z,(j)(z)  and  dX(z)/dz  = 
— z(j){z)/^{z)  —  4>(z)2 / i’iz)2  =  —Mz)(z  +  A.(z)).  Interpretation  of  these  three  deriva¬ 
tives  is  similar  to  that  discussed  in  some  detail  in  Section  16.3.5.  As  already  noted, 
analysis  of  the  censored  mean  is  appropriate  only  if  y2  takes  the  value  of  zero  when 
yi  =  0.  In  applications  such  as  the  log-normal  health  expenditures  example  discussed 
later  there  is  no  censored  mean. 


16.5.7.  Selection  on  Observables  and  on  Unobservables 

There  are  many  modeling  situations  that  can  be  considered  a  two-part  decision  prob¬ 
lem  of  first  engaging  in  an  activity  and  then  determining  the  level  of  the  activity.  These 
decisions  are  intertwined  and  can  be  expected  to  depend  on  common  factors.  The  nat¬ 
ural  model  for  such  data  is  the  bivariate  selection  model  (16.29)— ( 16.31). 

After  inclusion  of  regressors  any  remaining  error  (ei  and  e2)  in  the  two  processes 
may  in  some  cases  be  uncorrelated.  For  example,  for  models  of  hospitalization  it  is 
possible  that,  after  controlling  for  observed  individual  characteristics  such  as  health 
status,  there  is  no  correlation  between  the  error  in  the  equation  determining  hospital 
admission  and  in  the  error  in  the  equation  determining  length  of  hospital  stay.  In  that 
case  analysis  is  straightforward  as  selection  is  only  based  on  observables  since,  for 
example,  (16.37)  simplifies  when  cri2  =  0.  The  two  pieces  can  be  modeled  separately 
and  the  simpler  two-part  model  of  Section  16.4  can  be  used. 

In  other  cases  the  errors  may  be  correlated  even  after  inclusion  of  the  regressors. 
For  example,  in  labor  supply  unobserved  factors  that  make  someone  more  likely  to 
work  may  also  make  them  more  likely  to  work  longer  hours  than  would  be  predicted 
by  the  observable  regressors.  One  can  test  whether  there  is  such  correlation  between 
the  errors.  If  there  is  correlation,  then  selection  is  on  unobservables  and  the  methods  of 
this  chapter  come  into  play.  Relatively  strong  distributional  assumptions  are  needed, 
even  with  the  Heckman  two-step  method. 

The  study  by  Duan  et  al.  (1983)  summarized  in  Section  16.4.2  was  criticized  for 
using  the  two-part  model,  which  is  more  restrictive  than  the  sample  selection  model. 
This  led  to  considerable  debate,  with  many  of  the  relevant  articles  referenced  in  Feung 
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and  Yu  (1996),  who  emphasize  the  important  role  of  potential  correlation  of  the  inverse 
Mills  ratio  term  with  the  remaining  regressors. 

More  generally,  selection  models  such  as  the  bivariate  selection  model  permit  se¬ 
lection  on  both  observables  and  unobservables,  as  it  permits  selection  on  both  ob¬ 
served  regressors  and  unobserved  errors.  It  is  often  more  simply  referred  to  as  a  model 
of  selection  on  unobservables,  with  selection  on  observables  implicit.  This  chapter 
emphasizes  selection  on  unobservables. 

If  instead  we  have  only  selection  on  observables,  analysis  becomes  much  simpler. 
The  two-part  model  of  this  chapter  is  an  example.  Chapter  25  on  treatment  evaluation 
emphasizes  selection  on  observables  (see  the  discussion  in  Section  25.3.3)  and  details 
methods  such  as  propensity  score  matching. 


16.6.  Selection  Example:  Health  Expenditures 

For  illustration  we  use  data  from  the  RAND  Health  Insurance  Experiment  (RHIE). 
The  data  extract  comes  from  Deb  and  Trivedi  (2002),  who  modeled  the  number  of 
outpatient  visits  to  a  medical  doctor  and  to  all  providers  using  count  data  models. 
Section  20.3  summarizes  the  data  and  Section  20.7  presents  estimates  of  some  standard 
count  models. 

Here  instead  we  model  annual  health  expenditures.  The  regressors  are  the  same 
regressors  as  defined  in  detail  in  Table  20.4.  They  can  be  broken  down  into  health  in¬ 
surance  variables  (LC,  IDR  LPI,  and  FMDE),  socioeconomic  characteristics  (LINC, 
LFAM,  AGE,  FEMALE,  CHILD,  FEMCHILD,  BLACK,  and  EDUCDEC)  and  health 
status  variables  (PHYSLIM,  NDISEASE,  HLTHG,  HLTHF,  and  HLTHP).  The  analy¬ 
sis  in  Chapter  20  uses  four  years  of  data  whereas  here  we  use  only  the  second  year  of 
data,  yielding  5,574  observations  with  summary  statistics  similar  to  but  not  exactly  the 
same  as  those  given  in  Table  20.4. 

The  dependent  variable  y  is  annual  individual  health  expenditures.  An  econometric 
model  needs  to  take  account  of  two  complications:  (1)  Health  expenditures  are  zero 
for  23.2%  of  the  sample  and  (2)  the  positive  health  expenditures  are  very  right-skewed 
with  a  mean  of  $221  that  is  much  larger  than  the  median  of  $53.  The  logarithmic 
transformation  eliminates  this  skewness,  with  a  mean  of  4.07  close  to  the  median  of 
3.96  and  the  skewness  statistic  falls  from  24.0  to  0.3.  The  kurtosis  is  3.29,  close  to  the 
normal  value  of  3. 

We  focus  on  modeling  In  y  for  those  with  positive  medical  expenditures.  Possible 
models  include  a  two-part  model,  exposited  for  log  medical  expenditures  in  Section 
16.4.2,  and  a  bivariate  sample  selection  model  (see  Section  16.5.2),  where  yi  in  (16.29) 
is  an  indicator  for  positive  expenditures  and  y^  in  (16.30)  is  In  y.  Note  that  it  is  not 
meaningful  to  consider  the  value  of  y2  when  yi  =  0  because  InO  is  not  defined.  The 
two-part  model  is  a  special  case  of  the  bivariate  sample  selection  model  with  cri2  =  0 
in  (16.32). 

Table  16.1  presents  results  for  the  health  insurance  variables  and  health  status  re¬ 
gressors.  Socioeconomic  variables  also  included  in  the  regression  are  omitted  from  the 
table  for  brevity. 
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Table  16.1.  Health  Expenditure  Data:  Estimates  from  Two-Part  and  Selection  Models a 


Model 

Equation 

Two-Part 

Selection  Two-Step 

Selection  MLE 

DMED 

LNMED 

DMED 

LNMED 

DMED 

LNMED 

LC 

-0.119 

-0.016 

-0.119 

-0.028 

-0.107 

-0.076 

(-4.41) 

(-0.52) 

(-4.41) 

(-0.70) 

(-4.03) 

(2.25) 

IDP 

-0.128 

-0.079 

-0.128 

-0.028 

-0.109 

-0.150 

(-2.45) 

(-1.28) 

(-2.45) 

(-0.70) 

(-2.13) 

(-2.26) 

LPI 

0.028 

0.003 

0.028 

0.005 

0.029 

0.015 

(3.19) 

(0.28) 

(3.19) 

(0.47) 

(3.42) 

(1.42) 

FMDE 

0.008 

-0.031 

0.008 

-0.030 

0.001 

-0.024 

(0.47) 

(-1.69) 

(0.47) 

(-1.62) 

(0.05) 

(1.21) 

PHYSLIM 

0.273 

0.262 

0.273 

0.281 

0.285 

0.355 

(3.67) 

(3.81) 

(3.67) 

(3.50) 

(3.94) 

(4.70) 

NDISEASE 

0.022 

0.020 

0.022 

0.022 

0.021 

0.029 

(6.25) 

(5.78) 

(6.25) 

(4.29) 

(6.03) 

(7.54) 

HLTHG 

0.039 

0.144 

0.039 

0.147 

0.058 

0.156 

(0.88) 

(2.97) 

(0.88) 

(3.01) 

(1.35) 

(2.99) 

HLTHF 

0.192 

0.364 

0.192 

0.382 

0.224 

0.445 

(2.29) 

(4.13) 

(2.29) 

(3.98) 

(2.75) 

(4.66) 

HLTHP 

0.640 

0.787 

0.640 

0.833 

0.798 

0.999 

(3.01) 

(4.63) 

(3.01) 

(4.22) 

(3.90) 

(5.32) 

P 

0.000 

0.168 

0.736 

1.401 

1.570 

<712  =  Pa2 

0.000 

0.236 

1.155 

(0.47) 

(16.43) 

—In  L 

10184.1 

10170.1 

a  The  ^-statistics  are  in  parentheses.  Regressors  also  include  eight  socioeconomic  characteristics.  DMED  is  an 
indicator  for  whether  or  not  medical  expenditures  are  positive  and  LNMED  is  the  natural  logarithm  of  expen¬ 
ditures  if  positive.  The  ^-statistics  for  the  second  step  of  the  two-step  selection  model  are  based  on  errors  that 
correct  for  the  first-step  estimation  used  to  obtain  the  fitted  inverse  Mills  ratio  term. 


We  first  compare  the  two-part  model  estimates  with  the  two-step  estimates  of  the 
bivariate  sample  selection  model.  The  DMED  equation  estimates  are  identical  as  they 
are  obtained  by  probit  regression  of  DMED  on  the  same  regressors.  The  LNMED 
equation  estimates  differ  because  for  two-step  sample  selection  the  second-step  OLS 
regression  for  LNMED  additionally  includes  as  a  regressor  the  fitted  value  of  the  in¬ 
verse  Mills  ratio  term.  This  additional  term  is  statistically  insignificant  ( t  =  0.47)  and 
low  in  magnitude  with  implied  p  =  0.168  that  is  close  to  zero.  As  a  result  the  two 
models  lead  to  similar  coefficient  estimates  in  the  LNMED  equation. 

As  noted  in  Section  16.4.4  the  two-step  estimator  can  perform  poorly  if  the  inverse 
Mills  ratio  term  is  highly  correlated  with  the  other  regressors.  Here  this  does  not  appear 
to  be  the  case  as  there  is  considerable  range  in  the  probit  model  predicted  probabili¬ 
ties  from  0.15  to  0.99  and  the  condition  number  (see  Section  10.4.4)  of  the  second- 
stage  regressors  at  the  second  stage,  although  somewhat  high,  only  doubles  from  37 
to  82  upon  inclusion  of  the  inverse  Mills  ratio.  Although  it  is  still  preferable  to  have 
some  exclusion  restrictions,  it  is  not  clear  in  this  application  which  regressors  in  the 
DMED  equation  might  be  reasonably  excluded  on  a  priori  grounds  from  the  LNMED 
equation. 

The  ML  estimates  of  the  bivariate  sample  selection  model  differ  considerably  from 
the  previous  estimates,  in  both  DMED  and  LNMED  equations.  The  errors  in  the 
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latent  variable  models  for  DMED  and  LNMED  are  highly  correlated  with  estimate 
p  =  0.736  that  is  highly  statistically  significant  (t  =  16.43).  The  big  difference  be¬ 
tween  the  two-step  estimates  and  the  ML  estimates  of  cqi  (or  of  p)  is  best  viewed 
as  signifying  a  problem  with  the  bivariate  sample  selection  model.  Rejection  of  the 
null  hypothesis  that  the  estimates  have  the  same  probability  limit,  a  Hausman  test 
given  in  Section  8.4,  can  be  interpreted  as  rejection  of  the  additional  joint  normality 
assumption  needed  to  go  from  two-step  estimation  to  ML  estimation  of  the  bivariate 
selection  model.  However,  there  may  be  a  more  fundamental  problem  that  the  bivariate 
sample  selection  model  with  the  weaker  assumption  (16.41)  and  £i  iid  normal  is  also 
not  reasonable.  Such  fragility  of  the  bivariate  sample  selection  model  is  not  unusual, 
especially  if  the  same  regressors  are  being  used  in  both  parts  of  the  model  so  that  iden¬ 
tification  is  being  secured  through  model  specification  assumptions.  It  is  compounded 
here  by  use  of  health  expenditure  data,  which  can  have  quite  large  outliers  so  that  er¬ 
rors  may  not  be  normal.  Even  though  LNMED  has  skewness  close  to  0  and  kurtosis 
close  to  3,  as  already  noted,  standard  tests  of  heteroskedasticity,  skewness,  and  kurtosis 
resoundingly  reject  (with  p- value  0.000)  the  null  hypothesis  that  LNMED  is  normally 
distributed. 

The  regressor  of  most  interest  is  LC,  the  natural  logarithm  of  the  coinsurance  rate 
where  the  coinsurance  rate  equals  the  percentage  of  health  cost  borne  by  the  insured 
paid  by  the  patient.  The  most  statistically  significant  effect  is  in  determining  whether 
or  not  expenditures  are  positive,  rather  than  on  the  size  of  positive  expenditures.  If  all 
observations  were  positive  then  the  coefficient  of  LC  in  regression  on  LNMED  equals 
the  price  elasticity  of  demand  for  health  care.  In  fact  in  predicting  the  effect  of  changes 
in  price  on  the  conditional  truncated  mean  of  log  expenditure  we  need  to  control  for 
the  effect  of  those  with  zero  expenditure,  as  in  the  second  line  of  (16.43). 

In  some  applications  interest  lies  in  prediction  rather  than  estimation  of  marginal 
effects.  This  is  complicated  in  this  example  by  a  desire  to  predict  the  level  rather  than 
the  log  of  expenditure.  Assuming  log-normality,  the  expression  for  the  two-part  model 
is  given  in  (16.28).  Duan  et  al.  (1983)  present  a  method  to  make  predictions  without 
the  log-normality  assumption  that  can  be  viewed  as  a  variant  of  a  bootstrap.  See  also 
Mullahy  (1998). 


16.7.  Roy  Model 

In  the  bivariate  sample  selection  model  the  dependent  variable  for  an  individual  might 
not  be  observed.  Thus  we  observe  yi  for  an  individual  if  yi  =  1  but  may  not  observe 
Vo  at  all  if  yi  =  0.  In  this  section  we  consider  a  model  in  which  V2  is  observed  for  all 
individuals,  but  in  only  one  of  the  two  possible  states.  This  important  model  empha¬ 
sizes  counterfactuals  and  connects  with  the  program  evaluation  literature  presented 
in  Chapter  25. 


16.7.1.  Roy  Model 

An  often-cited  article  by  Roy  (1951)  considered  the  consequences  for  the  occupa¬ 
tional  distribution  of  earnings  (both  mean  and  variance)  when  there  is  individual 
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heterogeneity  in  skills  and  individuals  self-select  into  occupations.  The  treatment  was 
relatively  general  and  nonmathematical,  though  it  did  assume  that  individual  worker 
output  in  an  occupation  is  log-normally  distributed  in  the  absence  of  selection,  and  it 
did  not  consider  at  all  estimation  of  a  formal  model.  During  the  1970s  a  number  of 
authors  independently  proposed  models  for  similar  situations  that  were  estimable  with 
cross-section  data  and  considered  selection  on  both  observables  and  unobservables. 
Such  models  have  become  known  as  Roy  models. 

We  define  the  prototypical  Roy  model  as  follows.  A  latent  variable  y*  determines 
whether  the  outcome  observed  is  y*  or  y? .  Specifically,  we  observe  whether  y*  is 
positive  or  negative. 


_Jl  if  y*  >  0, 
-Vl  “  i  0  if  y*  <  0, 

and  observe  exactly  one  of  y*  and  y|  according  to 

jyf  if  yf  >  0, 

'  b3* 


(16.44) 


(16.45) 


It  is  customary  to  specify  a  linear  model  with  additive  errors  for  the  latent  variables, 
with 


y*  =  XiA  +  £i>  (16.46) 

Vi  =  X.^2  +  £2, 
y*  =  x'3/33  +  e3. 

A  model  with  additive  effect  is  the  specialization  x3/33  =  x'2f32  +  cr.  The  simplest  para¬ 
metric  model  for  correlated  errors  is  the  joint  normal,  with 


~s  r 

"0" 

1  <712  043 

e2 

0 

, 

a[2  ay  <723 

_  £3  _ 

_0_ 

_cri3  <723  rr2  _ 

(16.47) 


where  as  usual  the  normalization  erf  =  1  is  used  as  only  the  sign  of  y*  is  observed. 

The  log-likelihood  function  is  similar  to  that  for  the  bivariate  sample  selection 
model  of  Section  16.5,  except  that  now  y3  is  observed  if  y*  <  0,  so  the  term  Pr[y*.  < 
0]  in  (16.33)  is  replaced  by  /(y3(-  [  y*(  <  0)  x  Pr[yi,*  <  0]. 

It  is  more  common  to  estimate  the  model  using  Heckman’s  two-step  method  applied 
to  the  truncated  means. 


E[y|x,y*  >  0]  =  x'2(32  +  rr, 2 A.(xj /3 , ), 

E[y|x,y*  <  0]  =  x3/33  -  cr13M-x j/3i), 

where  'a(z)  =  0(z) /  O(z)  and  we  have  used  cTj2  =  1.  First-stage  probit  estimation  of 
whether  or  not  y*  >  0  yields  an  estimate  of  f3{  and  hence  /.(xj/li ).  Two  separate  OLS 
regressions  then  lead  to  direct  estimates  of  (]32,  an)  and  (/33,  an).  Estimates  of  ay 
and  ct32  can  then  be  obtained  using  the  squared  residuals  from  the  regressions,  similar 
to  the  technique  used  for  the  bivariate  sample  selection  model  after  (16.40).  Maddala 
(1983,  p.  225)  provides  complete  details  for  this  model,  which  he  calls  a  switching 
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regression  model  with  endogenous  switching.  This  is  also  the  Tobit  type  5  model 
presented  in  Amemiya  (1985,  p.  399). 


16.7.2.  Variations  of  the  Roy  Model 

Many  models  fall  into  the  class  of  Roy  models.  Maddala  (1983,  Chapter  9)  gives  nu¬ 
merous  references  to  what  he  calls  models  with  self-selectivity.  See  also  Amemiya 
(1985,  Chapter  10).  Here  we  present  a  few  leading  examples. 

The  bivariate  sample  selection  model  can  be  viewed  as  a  special  case  where  y| 
is  ignored  and  we  only  model  the  truncated  moment  E[y||y*  >  0].  Bivariate  sample 
selection  models  where  y  =  0  when  y*  <  0,  such  as  in  labor  supply  applications,  can 
more  directly  be  viewed  as  Roy  models  where  we  observe  either  y  =  y*  or  y  =  0,  so 

v(  -  0. 

In  the  study  of  L.-F.  Lee  (1978),  y*  and  y|  denote,  respectively,  union  and  nonunion 
wage  and  y*  denotes  tendency  to  be  a  union  member.  This  adds  the  additional  structure 
that 


y  i  =  y2  -  .v3  +  z  7  +  c 

where  z'j  +  f  reflect  costs  of  union  membership  and  is  very  much  in  the  spirit  of  Roy 
(1951).  Substituting  for  y|  and  y|  yields  a  reduced  form  for  y* : 

y*  =  (X2/32  -  *303  +  Z,T)  +  (£2  -  S3  +  ?)• 

This  model  is  now  the  same  as  the  earlier  model,  with  correction  term  A.(x,1/31)  obtained 
by  first-step  probit  regression  of  y\  on  X| ,  where  xi  denotes  the  unique  regressors  in 
x2,  X3,  and  z. 

If  only  the  intercept  varies  across  the  two  possible  outcomes,  by  an  amount  a  say, 
then  the  Roy  model  reduces  to  two  latent  variables 


y*  =  x'iA  +si, 
y*  =  x'/3  +  ay  1  +  e, 

where  y  =  y*  is  always  observed  and  we  also  observe  the  binary  variable  y\  equal  to 
one  if  y*  >  0  and  equal  to  zero  otherwise.  This  model  for  v  can  be  viewed  as  one  with 
dummy  endogenous  variable  (yi).  It  can  be  estimated  using  the  Heckman  two-step 
estimator  applied  to  the  expression  for  E[y*|x].  Alternatively,  instrumental  variables 
estimation  can  be  used,  provided  an  instrument  for  y  1  is  available.  This  requires  a  re¬ 
gressor  that  does  not  determine  the  level  of  the  outcome  of  interest  but  does  determine 
which  outcome  is  chosen. 

These  Roy  models  are  similar  to  the  models  studied  in  the  treatment  effects  litera¬ 
ture.  There  are  two  potential  outcomes,  here  y*  and  y|,  but  we  can  only  observe  one 
of  them.  The  approach  in  this  chapter  has  been  to  create  the  counterfactual  by  mak¬ 
ing  strong  distributional  assumptions  on  the  distribution  of  unobservables.  Chapter  25 
presents  alternative  methods.  See  especially  Section  25.3  for  connections  between  the 
different  approaches. 
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16.8.  Structural  Models 

Regression  models  for  selected  samples  have  the  feature  that  the  outcome  of  inter¬ 
est  depends  in  part  on  a  participation  decision  that  will  in  turn  depend  on  expected 
outcomes.  The  participation  decision  and  outcomes  are  simultaneous  decisions.  The 
preceding  presentations  simplified  this  interdependence  by  giving  a  reduced-form 
version  of  the  participation  equation.  In  particular,  see  the  exposition  of  Lee  (1978) 
in  Section  16.7.2.  This  is  a  valid  approach  though  is  less  efficient  than  working  with  a 
fully  structural  version. 

In  this  section  we  explicitly  model  the  interdependence  using  structural  economic 
models  based  on  utility  maximization,  and  using  structural  statistical  models  that  ex¬ 
tend  linear  simultaneous  equations  to  cover  censoring  and  truncation,  including  binary 
outcomes. 


16.8.1.  Structural  Models  Based  on  Utility  Maximization 

Initial  structural  model  research  considered  female  labor  supply.  The  textbook 
model  has  consumers  maximizing  utility,  a  function  of  goods  consumption  and  leisure 
time,  subject  to  a  budget  constraint  and  a  time  constraint  that  available  discretionary 
time  be  allocated  between  leisure  time  and  working  time.  At  an  interior  solution  the 
marginal  rate  of  substitution  (MRS)  between  leisure  and  goods  consumption  equals  the 
wage  rate.  However,  a  corner  solution  where  the  woman  chooses  not  to  work  can  arise 
if  the  MRS  exceeds  the  offered  wage.  Gronau  (1973)  and  Heckman  (1974)  presented 
econometric  models  consistent  with  utility  maximization  that  led  to  Tobit-like  models, 
accounting  for  the  additional  complication  that  the  offered  wage  is  not  observed  for 
women  who  do  not  work.  Subsequent  advances  include  incorporation  of  fixed  costs 
of  work,  leading  to  sample  selection  models,  and  use  of  panel  data,  leading  to  panel 
Tobit  models.  Killingsworth  and  Heckman  (1986)  and  Blundell  and  MaCurdy  (2001) 
provide  surveys  and  Mroz  (1987)  provides  an  application. 

To  illustrate  the  structural  approach  we  summarize  the  following  example.  Dubin 
and  McFadden  (1984)  modelled  household  consumption  of  energy  (electricity  or  nat¬ 
ural  gas)  and  choice  of  appliances  (such  as  electric  heater  or  natural  gas  heater)  as 
being  interrelated  decisions  coming  from  the  same  utility  function.  Specifically,  it 
is  assumed  that  for  the  /  th  of  m  appliance  portfolios  household  indirect  utility  is 
given  by 

Vj  =  { a0j  +  ai/p  +  a\pi  +  a2pi  +  w'7 +P(y  -  rj )  +  rj}e~^Pl  +  ej,  (16.49) 

where  pi  and  p2  denote  the  prices  of  electricity  and  gas,  y  denotes  income,  and  r j 
denotes  the  annualized  total  life-cycle  cost  of  portfolio  j  with 

G  =  piqij  +  piqij  +  pcj, 

where  q\j  and  q2j  denote  the  typical  electricity  and  gas  consumption  by  household 
with  appliance  portfolio  j,  Cj  is  the  cost  of  appliance  portfolio  j,  and  p  is  the  dis¬ 
count  rate.  Tastes  differ  across  households  owing  to  observable  characteristics  w,  un¬ 
observable  error  rj,  and  an  appliance  portfolio  specific  error  £j,  which  is  assumed  to  be 
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independent  over  j  but  correlated  with  ij.  In  addition,  there  is  a  common  appliance 
specific  taste  factor  ocqj  . 

Electricity  demandxi  given  appliance  portfolio  j  equals  —(dVj/dpi)/(dVj/dy),  by 
Roy’s  identity,  yielding 


*1  ~q\j=  “o j  +a\P\+  <*2P2  +  w'7 +P(y  -  rj )  +  T). 

To  emphasize  that  choice  of  appliance  portfolio  j  is  endogenous,  introduce  m  mutu¬ 
ally  exclusive  indicator  variables  Sjk,  k  =  1 , . . .  ,  m ,  where 


Sjk  = 


1  if  k  =  j 
0  if*  #7. 


Then  electricity  demand  x\  given  appliance  portfolio  j  is  given  by 


m  /  m  \ 

Xi  -qij  =  y^,aokSjk  +ceipi  +  a2p2  +  w'7  +  fi  I  y  -  ^  rj8jk  j  +ij.  (16.50) 

k= 1  \  k= 1  / 

Even  though  the  model  (16.50)  is  linear,  OLS  regression  yields  inconsistent  estimates 
as  the  result  of  endogeneity  of  8jk.  Dubin  and  McFadden  (1984)  present  two  alternative 
estimation  procedures. 

An  IV  approach  estimates  (16.50)  using  and  rfpk  as  instruments  for  Sjk  and 
rjSjk,  k  =  1, . . .,  m,  where  /)<  are  the  predicted  probabilities  of  choosing  the  various 
appliance  portfolios.  Here  V7  is  being  used  to  denote  the  indirect  utility  function.  It 
includes  both  deterministic  and  stochastic  components  of  utility  and  corresponds  to 
U j  in  the  Section  15.5.1  presentation  of  the  ARUM.  A  similar  approach  yields 


Pk  =  Pr[Vfc  >  V/,  /  /  k,  1=1,.. .,  m] 

=  Pr[e/  -  sk  <  {(ao*  -  ao/)  -  fi(rk  -  r,)}e~Pp\  all  /  ^  k] 
exp [(a0t  -  Prk)e~Ppi7t/k*/3] 

XXi  exp[(a0,  -  Prtie-PP'Tc/X^] 

under  the  assumption  that  the  Sj ,  j  =  \ ...  ..m.  are  iid  type  II  extreme  value  with  cdf 
F(s)  =  exp(— exp (—y  —  sjt/k.y/3)),  where  y  ~  0.5772  is  Euler’s  constant.  Note  that 
here  S  j  has  mean  zero  and  variance  a2/2  that  differ  from  those  for  the  parameterization 
of  the  type  II  extreme  value  distribution  used  in  Chapters  14  and  15.  Estimation  of  a 
nonlinear  multinomial  logit  model  gives  predicted  probabilities  'pk. 

An  alternative  sample  selection  approach  notes  that  E[r/| portfolio  j  chosen]  /  0 
and  uses  assumptions  on  the  distribution  of  rj  and  £1, . . .,  sm  to  obtain  this  expecta¬ 
tion.  Specifically,  assume  that  p\si, . . .,  sm  is  iid  with  mean  (a/2<t/A.)  Y^k=i  ^k£k  and 
variance  cr 2(1  —  Ylk=  1  ^)>  where  Y^k=i  =  ®  anc*  H'"=  \  Rk  <  I  and  the  distribu¬ 
tion  of  ek  has  already  been  given.  Then  performing  some  algebra  given  in  Dubin  and 
McFadden  yields 


E[?/ 1 portfolio  j  chosen]  =  Rk/n) 

kjkj 


'  Pk  In  Pk 
.  1  -  Pk 


+  In  Pk 
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A  Heckman  two-step  procedure  then  estimates  by  OLS 


*1  ~q\ j  =  ^uokSjk  +  aiPi  +  a2/?2  +  wr7  + /3 

lc=l 


+E 


Yk 


~%\n'pk 
.1  ~Pk 


+  In  pk 


+  $. 


where  pk  are  predicted  probabilities  from  the  preceding  model  for  pk,  and  £  is  an  error 
with  asymptotic  mean  zero. 

Dubin  and  McFadden  estimated  these  models  using  data  on  3,249  households  with 
two  possible  appliance  portfolios:  electric  for  water  and  space  heating  and  gas  for 
water  and  space  heating. 

Related  examples  include  those  of  Hanemann  (1984),  who  modeled  the  consump¬ 
tion  level  of  a  branded  good  where  consumers  consume  only  one  of  the  possible 
branded  goods  in  the  choice  set,  and  of  Cameron  et  al.  (1988),  who  modeled  health 
service  demand  conditional  on  choice  of  one  of  a  number  of  mutually  exclusive  health 
insurance  policies. 

Much  creativity,  evident  in  the  Dubin  and  McFadden  example,  can  be  required  to 
specify  a  model  that  yields  analytical  solutions  for  both  choice  probabilities  and  de¬ 
mand  conditional  on  choice.  The  advances  in  computational  methods  detailed  in  Chap¬ 
ters  12  and  13  permit  estimation  of  such  models  even  when  analytical  solutions  are  not 
obtained.  Nonetheless,  results  will  still  be  dependent  on  the  assumed  utility  function 
and  distribution  of  unobservables. 


16.8.2.  Simultaneous  Equations  Tobit  and  Probit  Models 

To  illustrate  the  issues  involved  in  extending  the  linear  SEM  approach  of  Section  2.4 
we  consider  a  selection  model  that  depends  on  two  latent  variables  and  introduce  si¬ 
multaneity  into  the  models  for  the  latent  variables.  A  quite  general  model  is 

y*  =  axyl  +  yxyx  +  <5iy2  +  +£i,  (16.51) 

y2  =  a2y*  +  nyi  +  hyi  +  x^2  +  £2, 

where  y*  and  y*  are  not  completely  observed  but  do  determine  the  observed  variables 
y  i  and  y2,  and  the  errors  are  assumed  to  be  joint  normally  distributed.  For  example, 
we  may  observe  the  binary  indicator  yx  =  1  if  y*  >  0  and  observe  y2  =  >’2  if  y*  > 
0.  Note  that  in  principal  either  latent  variables  or  observed  outcomes  or  both  may 
appear  as  regressors,  though  identification  requires  restrictions  such  as  those  given  in 
the  following. 


Endogenous  Latent  Variables 

It  is  simplest  to  permit  only  the  latent  variables  to  be  regressors  in  (16.51).  Then 

y*  =  «iy2  +xi^i  +  £i>  (16.52) 

y2  =  «2  y*i  +  x2/32  +  s2. 
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The  bivariate  sample  selection  model  (16.31)  is  an  example  that  additionally  specifies 
q'2  =  0  and  directly  specifies  a  reduced  form  rather  than  a  structural  form  for  the  y* 
equation.  Model  (16.52)  is  easily  estimated  because  the  reduced  form  for  y*  and  y|  can 
be  obtained  in  exactly  the  same  way  as  for  regular  linear  simultaneous  equations.  This 
reduced  form  can  then  be  estimated  using  methods  such  as  probit  or  Tobit  depending 
on  the  way  that  y\  and  _y2  are  determined  given  y*  and  y\.  The  parameters  of  the 
structural  model  (16.52)  can  then  be  estimated  by  replacing  the  regressors  y\  and  y* 
by  the  reduced-form  predictions  y|  and  y  * . 

Models  such  as  (16.52)  are  called  simultaneous  equations  Tobit  models.  A  simul¬ 
taneous  equations  probit  model  arises  if  the  observed  dependent  variables  yi  and  y2 
are  binary.  Estimators  are  presented  by  Nelson  and  Olson  (1978),  Amemiya  (1979), 
and  Lee,  Maddala,  and  Trost  (1980)  and  a  very  general  treatment  for  a  range  of  mod¬ 
els  is  given  in  L-F.  Lee  (1981).  The  standard  errors  of  the  estimators  can  be  obtained 
using  the  results  on  sequential  two-step  m-estimators  in  Section  6.6.  However,  it  is 
much  simpler  to  obtain  them  using  the  bootstrap  pairs  procedure  presented  in  Sec¬ 
tion  11.2.  Identification  requires  exclusion  restrictions  in  (16.51)  similar  to  those  for 
linear  simultaneous  equations. 


Endogenous  Regressors 

A  common  specialization  of  the  model  (16.52)  is  to  a  Tobit  model  with  endogenous 
regressor  that  is  completely  observed.  Then  _y*  is  fully  observed,  so  y2  =  y2,  whereas 
we  observe  y\  =  y*  if  y*  >  0  and  yi  =  0  otherwise.  The  model  becomes 


y  *  =  aly2  +  x'1/dI  +  s  i,  (16.53) 

y2  =  x'-n-  +  v, 

where  the  first  equation  is  the  structural  equation  of  interest  and  the  second  equation 
is  the  reduced  form  for  the  endogenous  regressor  y2-  Again  note  that  here  y2  is  con¬ 
tinuous,  not  discrete.  For  joint  normal  errors  £q  =  yv  +  §,  where  £  is  an  independent 
normal  error  (see  Section  5.1),  so  y*  =  aq_y2  +  x'1/31  +  yv  +  £. 

A  two-step  estimation  procedure  calculates  predicted  residuals  T  =  y2  —  x'if  from 
OLS  regression  of  y2  on  x  and  then  obtains  Tobit  estimates  from  the  model 


y*  =  axy2  +  x\f31  +  yv  +  eu 


where  the  error  e\  is  normally  distributed.  A  test  for  endogeneity  of  y2  can  be  imple¬ 
mented  as  a  Wald  test  of  y  =  0  using  the  usual  standard  errors  from  a  Tobit  package. 
This  test  is  an  extension  of  the  auxiliary  regression  to  implement  the  Hausman  endo¬ 
geneity  test  in  the  linear  model  (see  Section  8.4.3).  If  the  null  hypothesis  is  rejected 
then  the  aforementioned  second-step  Tobit  regression  yields  consistent  estimates  of  oq 
and  /3j ,  but  standard  errors  then  need  to  be  adjusted  because  of  first-step  estimation 
of  the  additional  regressor  T.  See  Smith  and  Blundell  (1986)  for  details  for  the  Tobit 
model  and  Rivers  and  Vuong  (1988)  for  a  similar  procedure  that  estimates  a  probit 
model  at  the  second  step. 
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Endogenous  Censored  or  Binary  Variables 

Analysis  is  more  complicated  if  the  observed  censored  or  binary  endogenous  vari¬ 
ables  >'|  or  y2  appear  as  regressors  in  (16.51).  Heckman  (1978)  considered  the  follow¬ 
ing  model: 

y*  =  nyi  +  hy2  +  xj/3i  +  £i>  (16.54) 

y2  =  a2y*  +  y2yi  +  x2/32  +  e2, 

where  we  observe  yi  =  1  if  v*  >  0  and  yi  =  0  if  y*  <  0,  and  we  observe  y2  =  y|  all 
the  time.  The  complication  here  is  that  y\  appears  as  a  regressor.  A  meaningful  reduced 
form  for  y*  can  depend  only  on  xi  and  x2  and  not  y  \ .  This  imposes  the  restriction  that 
<5 1 Y2  +  y\  =  0,  an  example  of  what  is  called  a  coherency  condition  in  this  literature. 
Then  the  reduced  form  of  the  model  becomes 

y*  =  x'tt]  +  v\ , 
y2  =  Y2y\  +  x'7t2  +  v2. 

This  is  a  special  case  of  the  Roy  model  where  participation  (yi  =  1)  leads  to  only  an 
intercept  shift  (via  y2)  in  the  outcome.  In  general,  models  with  regressors  that  include 
censored  or  truncated  endogenous  variables  are  difficult  to  estimate.  See,  for  example, 
Blundell  and  Smith  (1989). 


Example 

Brooks,  Cameron,  and  Carter  (1998)  applied  a  simultaneous  equations  Tobit  model 
to  explain  the  vote  by  congressional  representatives  on  a  pro-sugar  amendment.  The 
three  observed  outcomes  y\,  y2,  and  y2  were,  respectively,  the  vote  (yes  or  no)  and 
contributions  to  their  campaign  funds  from  sugar  interests  and  (opposing)  sweetener- 
user  interests.  The  first  outcome  is  a  binary  outcome  and  the  other  two  outcomes  are 
censored  at  zero.  A  simultaneous  equations  model  for  the  associated  latent  variables 
y*,  y2 ,  and  yj  was  specified,  so  the  structural  model  is  of  the  simpler  form  (16.52). 

How  reasonable  is  this  specification?  Here  campaign  contributions  y*  and  y|  should 
depend  on  the  latent  variable  y*  since  the  actual  vote  yi  was  made  at  a  later  date. 
For  y*  however,  an  alternative  and  more  difficult  model  is  that  y*,  the  latent  variable 
for  the  vote,  depends  on  actual  contributions  received  (y2  and  y2 )  rather  than  on  the 
latent  contributions.  However,  if  this  is  viewed  as  a  game  likely  to  be  repeated  in 
the  future,  a  case  can  be  made  for  using  y|  and  y? .  Clearly,  the  reasonableness  of 
such  assumptions  will  vary  with  the  application.  Parameter  identification  was  secured 
by  exclusion  restrictions  on  the  exogenous  regressors.  Consistent  estimation  relies  on 
errors  being  joint  normally  distributed. 

16.9.  Semiparametric  Estimation 

Censoring,  truncation,  and  sample  selection  lead  to  a  sample  that  differs  from  the  pop¬ 
ulation.  This  is  essentially  a  missing  data  problem,  one  that  is  complicated  because 
data  are  missing  on  the  dependent  variable(s)  rather  than  on  exogenous  regressors. 


562 


16.9.  SEM1PARAMETR1C  ESTIMATION 


The  preceding  methods  solved  this  missing  data  problem  by  making  distributional  as¬ 
sumptions  to  obtain  either  a  likelihood  function  for  the  sample  data  or  an  appropriate 
censored,  truncated,  or  selected  conditional  mean. 

These  methods  are  fragile  to  even  very  minor  misspecification  of  error  distributions. 
For  example,  both  the  MLE  and  the  Heckman  two-step  estimator  in  the  standard  Tobit 
model  are  inconsistent  if  errors  are  normal  but  heteroskedastic,  or  if  they  are  homo- 
skedastic  but  nonnormal.  See,  for  example,  Paarsch  (1982)  and  the  references  therein. 

Considerable  efforts  have  been  devoted  to  developing  semiparametric  estimators 
that  are  consistent  under  weaker  distributional  assumptions.  Before  presenting  leading 
examples,  however,  we  note  that  an  alternative  is  to  continue  to  take  a  fully  parametric 
approach  that  is  based  on  richer,  more  flexible  distributional  assumptions. 

16.9.1.  Flexible  Parametric  Models 

For  simplicity  begin  with  the  classical  Tobit  model  y*  =  x'/l  +  The  assumption 
that  Sj  ~  A/"[0,  o2]  can  be  relaxed  in  two  ways.  First,  heteroscedasticity  can  be  incor¬ 
porated  through  an  explicit  model  o2  =  exp(z-7),  where  now  both  (3  and  7  need  to  be 
estimated.  Second,  more  flexible  distributions  than  the  normal  distribution  might  be 
used.  For  example,  one  might  use  a  squared  polynomial  expansion  of  the  normal  (see 
Section  9.7.7). 

For  the  bivariate  sample  selection  model  a  similar  approach  may  be  taken,  where 
now  a  more  flexible  joint  distribution  for  (ei ,  £2)  is  used.  Lee  (1983)  proposed  working 
with  transformations  (£*,  sX)  of  (s,\ .  £2)  for  which  the  bivariate  normality  assumption 
may  be  more  reasonable. 

Bayesian  methods  can  also  be  applied  to  such  models.  Chib  (1992)  considered  the 
censored  Tobit  model.  The  latent  variables  v*  are  introduced  as  auxiliary  variables  and 
the  data  augmentation  approach  (see  Section  13.7)  is  used.  The  Gibbs  sampler  cycles 
among  (1)  the  conditional  posterior  for  /3|y,  v* ,  a1,  (2)  the  conditional  posterior  for 
a2|y,  y*,  (3,  and  (3)  the  posterior  for  y*|y,  f3,  o2. 

A  flexible  parametric  approach  is  particularly  advantageous  for  handling  censor¬ 
ing,  truncation,  and  sample  selection  in  nonlinear  models  such  as  those  for  counts  and 
for  duration  data  or  mixed  types  of  data,  as  semiparametric  methods  are  less  likely  to 
be  available  then. 


16.9.2.  Semiparametric  Estimation  for  Censored  Models 

We  now  move  on  to  semiparametric  estimation.  We  consider  a  linear  model  for  the 
latent  variable  y*  =  x'/3  +  £,-,  which  is  left-censored  at  zero  so  that  we  observe  y,  =  y* 
if  y*  >  0  and  y,  =  0  if  y*  <  0.  The  semiparametric  literature  usually  expresses  the 
model  as 


y;  =  max(x'/3  +  7, 0).  (16.55) 

This  is  the  Tobit  model  (16. 1 1)— (16. 13),  except  the  distribution  of  s  is  unspecified. 
With  some  adaptation  this  model  also  covers  left-censoring  at  known  fixed  point 
other  than  zero  and  to  right-censoring  such  as  for  top-coded  data.  For  example,  if 
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y  =  min (x'/3  +  s,  U )  then  U  —  y  =  ma x(U  —  x' /3—e,  0).  The  goal  is  to  consistently 
estimate  (3  without  specifying  a  complete  parametric  distribution  for  e, .  The  estimators 
are  called  semiparametric  as  the  uncensored  mean  xJ/3  is  parameterized  but  the  error 
distribution  is  not.  The  methods  presented  in  the  following  differ  in  the  assumptions 
made  on  the  distribution  of  s. 

From  (16.8)  ML  estimation  is  possible  given  knowledge  of  the  cdf  of  y*  and  hence 
of  s.  The  cdf  of  e  can  be  nonparametrically  estimated  using  the  Kaplan-Meier  prod¬ 
uct  limit  estimator  for  the  cdf  presented  in  Chapter  17  for  the  case  of  right-censored 
duration  data.  Alternatively,  the  distribution  of  e  can  be  nonparametrically  determined 
using  the  series  expansion  of  Gallant  and  Nychka  (1987);  see  Section  9.7.7.  These 
semiparametric  ML  estimation  methods  are  rarely  implemented. 

Instead,  the  literature  focuses  on  estimation  based  on  conditional  moments.  From 
(16.20)  the  conditional  censored  mean  E[y|x]  is  clearly  a  single-index  model  with 
E[v|x]  =  g(x!  ft),  where  the  function  g(-)  is  unknown  if  the  distribution  of  s  is  not 
specified.  The  single-index  methods  of  Section  9.7.4  can  therefore  be  applied,  though 
as  noted  there  /3  can  be  estimated  only  up  to  location  and  scale. 

A  more  popular  approach  considers  alternative  conditional  censored  moments  that 
are  less  altered  by  censoring.  Powell  (1984)  proposed  using  the  conditional  median. 
The  key  distributional  assumption  is  that  e|x  has  median  zero,  in  which  case  the  con¬ 
ditional  median  of  y|x  equals  the  conditional  mean  x'/3.  The  intuition  for  Powell’s 
estimator  is  most  easily  obtained  by  supposing  y  is  iid.  If  less  than  half  the  sample  is 
censored,  so  that  less  than  half  of  the  observations  are  zero  and  more  than  half  are  pos¬ 
itive,  then  the  censored  sample  median  provides  a  consistent  estimate  of  the  population 
median.  Powell  (1984)  extended  this  idea  to  the  regression  case,  where  the  same  logic 
follows  for  those  observations  for  which  less  than  half  the  observations  on  e|x  are  cen¬ 
sored,  where  e  =  y  —  x'f3  depends  on  (3 ,  which  needs  to  be  estimated.  The  regression 
analogue  of  median  estimation  is  LAD  estimation  (see  Section  4.6).  This  leads  to  the 
censored  least  absolute  deviations  (CLAD)  estimator  /3CLAD,  which  minimizes 

N 

QN(f3)—N~1  J2  I y<  ~  max(x;/3,  0)|.  (16.56) 

i=  1 

The  essential  assumption  for  consistency  of  this  estimator  is  that  e|x  has  median  zero. 
Given  this  assumption  the  estimator  is  consistent  even  if  errors  are  conditionally  het- 
eroskedastic.  The  estimator  for  (3  is  \/]V-consistent  and  asymptotically  normal.  More 
efficient  estimators  can  be  obtained  by  weighting  the  terms  in  sums  by  /(0|x,),  the 
conditional  density  of  £,  |x,  evaluated  at  zero.  The  method  can  also  be  extended  to 
conditional  quantiles. 

An  alternative  procedure  uses  a  symmetrically  trimmed  mean,  rather  than  the  me¬ 
dian,  that  is  also  unaffected  by  censoring.  Assume  that  the  distribution  of  e|x  is  sym¬ 
metrically  distributed.  This  implies  that  for  observations  with  positive  mean  (i.e., 
x'/3  >  0)  y|x  is  symmetrically  distributed  on  the  interval  (0,  2x! (3).  Then  either  x  (3+ 
e  <  0  and  y  =  0  is  observed  or,  with  equal  probability,  x  (3  +  e  >2x (3  and  the  data 
are  artificially  set  to  2x'/3  to  preserve  the  symmetry  about  x  (3.  We  have  shown  that 

E[l(x'/3  >  0)[min(y,  2x'/3)  -  x'/3]x]  =  0,  (16.57) 
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where  l(x'/3  >  0)  restricts  attention  to  observations  with  positive  mean,  and  the  new 
dependent  variable  is  y  =  0,  or  0  <  y  <  2x'/3,  or  2 x'/3  if  v  >  2x'/3.  The  moment  esti¬ 
mator  based  on  (16.57)  does  not  have  unique  solution  for  f3.  Powell  (1986b)  proposed 
the  symmetrically  censored  least  squares  (SCLS)  estimator  that  minimizes 

N 

Qn(P)  =N~1  -  max(y,/2,  xj/3)]2  +  l(v,-  >  2x'if3)[y?/4  -  max(0,  x'/3)]2}, 

1  =  1 

(16.58) 

which  with  some  algebra  can  be  shown  to  yield  first-order  conditions  that  are  the 
sample  analogue  of  moment  condition  (16.57).  Chay  and  Honore  (1998)  provide  a 
graphical  exposition  of  the  trimming  for  the  SCLS  estimator,  as  well  as  for  the  related 
pairwise  difference  estimators  of  Honore  and  Powell  (1994). 

Melenberg  and  Van  Soest  (1996),  Chay  and  Honore  (1998),  and  Chay  and  Pow¬ 
ell  (2001)  provide  applications  of  some  of  these  estimators.  Pagan  and  Ullah  (1999) 
provide  additional  methods  and  theory. 

As  an  empirical  example  we  applied  CLAD  estimation  to  the  Section  16.2.1  data 
that  were  generated  from  a  Tobit  model  with  normal  errors.  The  slope  parameter  (set 
to  1000)  was  estimated  to  be  956  (standard  error  117)  using  ML  compared  to  838 
(standard  error  165)  using  CLAD.  As  expected  the  CLAD  robustness  to  nonormality 
comes  at  the  expense  of  some  loss  in  efficiency. 

16.9.3.  Semiparametric  Estimation  for  Selection  Models 

Semiparametric  estimation  of  sample  selection  models  is  more  challenging.  We  con¬ 
sider  the  most  commonly  studied  model,  the  bivariate  sample  selection  model  de¬ 
fined  in  Section  16.5.2,  where  now  we  relax  the  assumption  that  the  errors  (ei,  £2)  are 
joint  normally  distributed. 

Semiparametric  ML  estimation  is  possible.  In  particular  Gallant  and  Nychka  (1987) 
explicitly  considered  the  bivariate  sample  selection  model  as  a  suitable  candidate  for 
their  series  expansion  estimator  presented  in  Section  9.7.7. 

The  literature  instead  uses  as  starting  point  the  expression  for  the  truncated  condi¬ 
tional  mean,  which  from  (16.34)  is  given  by 

E[.V2i  I X;,  y*u  >  0]  =  x'2i./32  +  E[£2|£!  >  -xj,/3,] 

=  x',./32  +  g(xj,./31), 

where  the  second  equality  assumes  that  £2,jx,-,  e  \ ,  has  distribution  that  depends  on  just 
X],  similar  to  assumption  (16.41).  The  distribution  of  (ei,  £2)  is  left  unspecified  so  the 
function  g(-)  is  unknown,  leading  to  a  semiparametric  estimation  problem.  Since  it 
is  possible  that  g(xj/3|)  =  xj /3 , ,  identification  in  this  model  with  g(-)  unspecified  re¬ 
quires  an  exclusion  restriction  that  at  least  one  component  of  xi  does  not  appear  in  x2. 
Moreover,  the  more  uncorrelated  xj  /3,  is  with  x2  the  better  j32  and  g(-  )  can  be  distin¬ 
guished.  The  model  (16.59)  is  a  partially  linear  model,  which  can  be  estimated  using 
methods  presented  in  Section  9.7.3.  Popular  methods  include  the  Robinson  (1988a) 
differencing  estimator  and  using  a  series  expansion  for  g(xj/3]).  Since  (3X  is  unknown 
the  regression  is  of  y2,  on  xj(/32  +  g(xjI/31),  where  can  be  obtained  by  regression 
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of  the  binary  outcome  yi,  on  x1(  ,  using  one  of  the  semiparametric  binary  model  esti¬ 
mators  given  in  Section  14.7.  These  methods  provide  consistent  estimates  of  the  slope 
parameters  (32.  To  additionally  estimate  the  intercept,  necessary  for  analysis  of  the 
levels  rather  than  changes  in  y2,  see  Andrews  and  Schafgens  (1998). 

Newey,  Powell,  and  Walker  (1990)  applied  this  approach  to  female  labor  supply. 
The  participation  indicator  model  was  estimated  using  several  different  methods  and 
the  equation  for  the  outcome  y2  was  estimated  using  the  method  of  Robinson  (1988a). 
Melenberg  and  Van  Soest  (1996)  modeled  vacation  expenditures  using  a  wide  range  of 
semiparametric  methods  for  both  the  bivariate  sample  selection  and  censored  regres¬ 
sion  models.  A  richer  model  is  provided  by  Das,  Newey  and  Vella  (2003). 

Manski  (1989)  considered  identification  in  the  bivariate  sample  selection  model 
under  relatively  minimal  assumptions  and  provided  bounds  for  the  mean  and  for 
marginal  effects,  conditional  on  both  regressors  and  selection. 


16.10.  Derivations  for  the  Tobit  Model 

16.10.1.  Truncated  Moments  of  Standard  Normal 

Consider  z  ~  A/"[0,  1],  with  density  4>(z)  =  (l/\/27r)exp(— z2/2)  and  cdf  <t>(z).  Since 
Pr[z  >  c]  =  1  —  O  (c),  the  conditional  density  of  z\z  >  c  is  0(z)/(  1  —  (c)).  It  fol¬ 
lows  that 


E[z|z  >  c] 


-f 

-f 

-f 


z(<P(z)/[  1  -  <t>(c)])  dz 
z(l/\/2ir)  exp(— z2/2)  dz  j  [1  -  O  (c)] 

(- (l/V2^)exp(-z2/2))  dzj  \  1  -  <J>(c)] 
[-(l/V2^)exp(-z2/2)]^/[l  -  <t>  (c)] 


3 

dz 


Similarly, 

E[z2|z  >  c]  = 


=  <t>(c)/[  1  -  <t>  (c)]. 


z2(<P(z)/[  1  -  o  (C)])  dz 


i: 

-r. 

-r 

—  |^z  X  (-l/\/27r)exp(-z2/2)j  j  [1  -  O(c)] 


z  x  z  x  (1/  \[7jt )  exp( — z2/2)  dz  j  [1  -  4>(c)] 
z  x  ^  (-(l/V2jr)exp(-z2/2)^  dz  j  [1  -  O  (c)] 


-  j  ^(z)  X  (-(l/V2jr)exp(-zi2/2))  dzj  [1  -  <b(c)] 
=  c<l>(c)/[  1  -  O  (C)]  +  (1  -  0(c))/[l  -  CD  (c)] 

=  c4>(c)/[  1  -  0(c)]+  1. 
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It  follows  after  a  little  algebra  that 

V[z|z  >  c]  =  E[z2|z  >  c]  -  (E[z|z  >  c])2 

=  1  +  c0(c)/[  1  -  O(c)]  -  He)2 /[l  ~  4>(c)]2. 


16.10.2.  Asymptotic  Theory  for  Heckman’s  Two-Step  Estimator 

in  the  Tobit  Model 

The  asymptotic  variance  matrix  of  the  two-step  Heckman  estimator  is  complicated  by 
its  dependence  on  first-step  parameter  estimates.  There  are  several  ways  to  obtain  the 
asymptotic  variance,  such  as  that  in  Amemiya  (1985,  pp.  369-370).  Here  we  instead 
apply  the  general  result  for  sequential  two-step  m-estimators  given  in  Section  6.6. 
We  consider  the  simplest  case  of  the  Tobit  model  (see  Section  16.3.6).  The  methods 
can  be  adapted  to  two-step  estimators  for  the  bivariate  sample  selection  model  (Sec¬ 
tion  16.5.4)  and  simultaneous  equations  Tobit  model  (Section  16.8.2).  A  much  simpler 
quite  different  approach  is  to  use  the  bootstrap  pairs  procedure  (see  Section  1 1.2). 

From  (16.26)  we  wish  to  estimate  the  parameters  7  =  [(3'  a]'  in  the  equation  for 

positive  v, : 


y;  =  X;/3  +  aMx-a)  +  /?,- 

=  w,(o9'7  +  r)j, 


where  w,-(a)  =  [ x-  a,(x'«)|'  and  r]j  =  y,-  —  x' (3  —  ovTx'o:)  is  heteroskedastic  with 
variance  er2.  defined  in  (16.24).  The  first  step  of  the  two-step  procedure  is  to  obtain 
an  estimate  3  of  the  unknown  parameter  a  by  probit  MLE.  It  follows  that  the  normal 
equations  for  the  two  parts  of  the  Heckman  two-step  estimator  are 


N 


-  $(*/«))- 


02(x'o9 


;= 1 


O(x-O')(l  —  O(xjo:)) 
-  diWj(a :)( yt  -  w,-(a)'7)  =  0 


x,  =  0, 


(16.60) 


N 


1=1 


where  the  first  equation  gives  the  probit  first-order  conditions  for  a ,  and  the  second 
equation  gives  first-order  conditions  for  7  for  OLS  on  positive  y,  (d,  =  1). 

These  equations  can  be  combined  as  ^2 1  h(x,  ,  9)  =  0  where  9  =  (a',  7 By 

the  usual  first-order  Taylor  series  expansion  7  —  7  ->  A/"[0,  Gq  1So(Gq  where 
Go  =  lira N-lE[J2?=i  9h(x/’  9 ) /96>]  and  So  =  limiV^E^j  h(x,-,  0)h(x,,  9)]'.  We 
are  interested  in  the  subcomponent  corresponding  to  7.  Simplification  occurs  because 
3h(x, ,  9)/d9  is  block  triangular  because  7  does  not  appear  in  the  first  set  of  equations. 
Partitioning  yields  the  general  result 


V[02]  =  {S22  +  G2i [G^1  S! , G^1  ]G'21  -  GziG^/Sn  -  SjiG^'G' 


21 


G 


-1 
22  ’ 


where  the  matrices  are  defined  in  Section  6.6. 
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Specializing  to  the  problem  here,  we  first  consider  the  terms  in  Gq.  Then 


Gil  =lim^L,=i 

G21  =  lim  i  E;I  , 
G22  =  lim  i  E,=i 


Olx'-aXl  —  <J>(xJq;)) 
3A(X;0;) 


x,x 


dj  w, 


da 


E[d,w,w']. 


The  expression  for  Gn  uses  knowledge  that  Gu*  is  just  the  variance  of  the  probit  MLE. 
The  expression  for  G21  uses 


'3h2f 

—  F 

ddi\Vi(a)(yi  -  w;(a)'7)' 

.  90'i . 

-  L_y 

da 

ddiW,(a) 


=  E 


3  A  (x-a) 
da 


The  expression  for  G22  uses 

3h2;  3d,  w,  («)(>’,  -  w,  (a)'7)  , 

— —  =  -  =  d,  W,  W:  . 

36>;  87  ' 

Turning  to  So  we  have 

Sn  =  Gjj', 

521  =  0, 

522  =  lim  jj  E,=i  E[d, •(}>,■  -  w,(o;)'7)2]. 

The  expression  for  Sn  follows  by  applying  the  information  matrix  equality.  Taking 
expectations  and  some  manipulation  leads  to  S21  =  0,  and  S22  is  simply  V[  q!  ]. 

Combining  these  results  gives  the  Heckman  two-step  estimator  7  ~  A/”  (7,  V7), 
where 


V7=(w'w)  1  (W'S^W  +  W'DVaDW)  (W'W)  (16.61) 

and  where  W'W  =  E/'ll  d,  w,  w' ,  D  =  Diag[8A(x'a)/3«  |s],  V0:  is  the  variance  ma¬ 
trix  for  the  first-stage  probit  MLE,  and  is  a  diagonal  matrix  with  /th  entry  ct2  .  This 
estimate  is  straightforward  to  obtain  if  matrix  commands  are  available.  The  hardest 
part  can  be  analytically  obtaining  cr2  =  V[  /■;,  ]  given  in  (16.24).  If  this  is  difficult  we  can 
instead  use  cr  j  =  (y,-  —  x'/3  +  a/.,(x'«))2  following  the  approach  of  White  (1980). 


16.11.  Practical  Considerations 

Most  major  packages  include  ML  estimation  of  the  Tobit  model  under  normality.  The 
two-part  model  is  easy  to  estimate  as  one  can  separately  estimate  each  part.  In  principle 
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the  bivariate  sample  selection  model  can  be  estimated  by  Heckman’s  two-step  proce¬ 
dure  using  only  a  probit  and  OLS  routine.  However,  the  standard  errors  are  difficult 
to  compute  owing  to  the  two-step  nature  of  the  estimator,  and  it  is  much  easier  to 
obtain  standard  errors  using  a  package  with  Heckman’s  two-step  procedure  built-in. 
Implementing  semiparametric  estimators  generally  requires  specialized  code  in  a  pro¬ 
gramming  language  such  as  GAUSS.  Some  packages  also  permit  ML  estimation  of 
censored  and  truncated  variants  of  other  models,  such  as  the  Poisson  and  negative 
binomial  for  count  data. 

Censoring  and  truncation  are  easily  handled  if  one  views  as  reasonable  the  specified 
distribution.  For  example,  top-coded  income  data  are  easily  handled  if  the  log-normal 
distribution  fits  the  data  well.  Censored  LAD,  which  relies  on  much  weaker  distribu¬ 
tional  assumptions,  can  also  be  used  in  this  situation. 

Much  more  problematic  is  handling  models  with  sample  selection.  The  more  para¬ 
metric  versions  of  these  models  can  rely  on  distributional  assumptions  that  are  felt  to 
be  strong.  Semiparametric  versions  still  have  to  struggle  with  the  identification  require¬ 
ment  that  a  variable  that  determines  participation  does  not  also  determine  the  outcome 
of  interest.  A  more  promising  route,  one  often  taken  in  the  treatment  effects  literature, 
is  to  limit  attention  to  cases  where  it  may  be  reasonable  to  assume  that  selection  is  only 
on  observables. 


16.12.  Bibliographic  Notes 

The  literature  on  models  from  selected  samples  is  vast.  Book-length  treatments  are  provided 

by  Maddala  (1983)  and  Gourieroux  (2000),  and  shorter  summaries  are  provided  by  Amemiya 

(1984,  1985)  and  Greene  (2003). 

16.3  Tobit  (1958)  proposed  and  applied  the  Tobit  model  to  expenditure  data.  Amemiya 
(1973)  formally  established  its  consistency  and  asymptotic  normality.  Heckman 
(1974)  provides  an  excellent  female  labor  supply  application  with  detailed  analysis 
of  results. 

16.4  The  many  studies  of  the  Rand  Health  Insurance  Experimant,  such  as  that  by  Duan 
et  al.  (1983),  are  leading  applications  of  the  two-part  model. 

16.5  Heckman  (1976,  1979)  presented  the  two-step  estimator  of  the  bivariate  sample  se¬ 
lection  model  that  is  also  the  basis  for  many  more  recent  semiparametric  estimation 
procedures.  Mroz  (1987)  provides  an  excellent  application  to  female  labor  supply 
that  places  emphasis  on  the  role  of  assumptions  on  wage  exogeneity. 

16.7  There  are  many  variants  on  the  ideas  of  Roy  (1951),  just  as  there  are  many  variants 
of  the  Tobit  model.  L-F.  Lee  (1978)  provides  a  good  early  application  to  the  union- 
nonunion  wage  differential. 

16.8  The  work  by  Dubin  and  McFadden  (1984)  is  a  leading  example  of  structural  micre- 
conometric  analysis  based  on  complete  specification  of  utility  function  and  distribu¬ 
tion  of  unobservables. 

16.9  Semiparametric  estimation  of  binary  choice  models  is  presented  in  detail  in  the  books 
by  M-J.  Lee  (1996),  Horowitz  (1997),  and  Pagan  and  Ullah  ( 1999)  and  in  surveys  by 
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Vella  (1998)  and  L-F.  Lee  (2001).  Chay  and  Honore  (1998)  and  Chay  and  Powell 
(2001)  provide  applications  for  censored  models,  and  Melenberg  and  Van  Soest 
(1996)  additionally  estimate  bivariate  sample  selection  models. 


- Exercises - 

1 6-1  This  question  considers  the  impact  of  different  degrees  of  truncation  in  the  Tobit 
model. 

(a)  Generate  200  draws  of  a  latent  variable  y*  =  k  +  3x+  u,  where  u  ~  A/"[0,  3] 
and  the  regressor  x~  uniform[0, 1].  Choose  k  such  that  you  generate  ap¬ 
proximately  30%  of  y*  to  be  negative. 

(b)  Generate  a  censored  or  truncated  subsample  by  excluding  observations 
that  correspond  to  y*  <  0. 

(c)  Estimate  the  model  using  all  2,000  observations,  as  if  the  latent  variable 
were  observable,  by  OLS.  Evaluate  your  results  in  the  light  of  the  theoretical 
properties  of  OLS,  keeping  in  mind  that  you  have  only  one  replication. 

(d)  Using  the  truncated  subsample  of  y  >  0  only,  estimate  the  model  by  OLS. 

(e)  Use  the  truncated  maximum  likelihood  option  to  estimate  the  parameters 
using  all  observations.  Evaluate  your  results  in  light  of  the  properties  of  the 
truncated  MLE.  Compare  with  the  least-squares  results  from  the  previous 
two  parts. 

(f)  Repeat  all  previous  steps  using  a  value  of  k  so  as  to  generate  20,  40,  and 
50%  censored  observations.  Compare  your  results  with  those  based  on 
30%  censored  observations.  Hence  suggest  what  is  the  consequence  on 
the  parameter  estimates  of  higher  levels  of  censoring.  Reinforce  your  argu¬ 
ments  using  theory  where  possible. 

16-2  Consider  a  latent  variable  modeled  by  y*  =  x'/3  +  with  e;  ~AC[0,  a2].  Sup¬ 
pose  y*  is  censored  from  above  so  that  we  observe  y  =  y*  if  y*  <  Uj  and 
Y  =  Ui  if  y*  >  L /;,  where  the  upper  limit  U ,  is  a  known  constant  for  each  in¬ 
dividual  (i.e.,  data)  and  may  differ  over  individuals. 

(a)  Give  the  log-likelihood  function  for  this  model.  [Hint:  Note  that  this  differs 
from  the  standard  case  both  owing  to  presence  of  Uj  and  because  the  equal¬ 
ities  are  reversed  with  y  =  y*  if  y*  <  Uj.] 

(b)  Obtain  the  expression  for  the  truncated  mean  E[y|x,,  y  <  Uj].  [Hint:  Forz~ 
7V[0, 1],  we  have  E[z|z>  c]  =  <p(d)/[  1  -  <t>(c)].  Also,  E[z|z<  c]  =  -E[-z|  - 
z>  -c]  and  -z~  AC[ 0,  1].] 

(c)  Hence  give  Heckman’s  two-step  estimator  for  this  model. 

(d)  Obtain  the  expression  for  the  censored  mean  E[y|x,].  [Hint:  An  essential 
part  is  the  answer  in  part  (b).] 

16-3  This  question  considers  the  consequences  of  misspecification  in  the  Tobit 
model.  The  starting  point  is  the  model  of  Exercise  16.1. 

(a)  Generate  y*  with  heteroskedasticity  by  letting  u  ~  Af[0,  a2z],  where  z>  0 
is  chosen  to  be  a  suitable  positive-valued  variable  that  is  correlated  with 
x,  though  not  perfectly  so.  Again  set  k  to  obtain  about  30%  of  censored 
observations.  Use  the  MLE  for  censored  normal  to  estimate  this  model  and 
compare  your  results  with  the  corresponding  homoskedastic  case. 
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(b)  Now  consider  the  impact  of  nonnormality  in  the  sample.  Use  the  simulation 
macro  available  in  some  packages  to  carry  out  a  Monte  Carlo  evaluation 
based  on  a  sample  of  1 ,000  observations  and  500  replications.  In  each  repli¬ 
cation  generate  a  sample  with  censored  observations  such  that  the  errors 
are  drawn  from  a  mixture  of  two  normals:  A'”[1 , 9]  or  AA[0.4, 1]  with  prob¬ 
abilities  0.4  and  0.6,  respectively.  Estimate  the  model  using  the  censored 
Tobit  MLE  and  compare  your  results  with  the  normal  case.  Carry  out  an 
analysis  of  the  Monte  Carlo  output  for  the  two  estimators.  Draw  appropriate 
conclusions  about  the  impact  of  nonnormality  on  the  distribution  of  the  Tobit 
estimator. 

16-4  Consider  a  Poisson  regression  model  where  y*  has  density  f*(y*)  = 
e~>*ny/y*\,  y*  =  0, 1 , 2, . . .,  and  we  have  independence  over  /.  Because  of  cod¬ 
ing  error  we  only  fully  observe  y*  when  y*  >  2.  When  y*  =  0  or  1  we  only  ob¬ 
serve  that  y*  <  1.  Suppose  this  is  coded  as  y*  =  1.  Define  the  observed  data 
y=  y*  for  y*  >  2  and  y  =  1  for  yj*  =  0  or  1 . 

(a)  Obtain  the  density  f(y)  of  the  observed  y. 

(b)  Obtain  E[y].  [There  is  some  algebra  here.] 

Now  introduce  regressors  with  E[y*|x]  =  exp(x'/3)  and  define  the  indicator 
variable  d  =  1  for  y*  >  2  and  d  =  0  for  y*  =  0  or  1 . 

(c)  Give  the  exact  formula  for  this  example  of  the  objective  function  of  an  es¬ 
timator  that  provides  a  consistent  estimator  of  (5  using  data  on  y,  d, |,  and 
x, 

(d)  Give  the  exact  formula  for  this  example  of  the  objective  function  of  an  es¬ 
timator  that  provides  a  consistent  estimator  of  /3  using  data  on  only  of  and 

x, 

(e)  Is  it  possible  to  consistently  estimate  f3  using  data  on  only  oj  and  x,?  Explain 
your  answer. 

1 6-5  Using  a  50%  random  subsample  of  the  RAND  data  on  medical  expenditure  over 
a  12-month  period  used  in  this  chapter,  and  using  a  similar  model  specification, 
we  wish  to  consider  the  following  broad  question:  Which  model  is  appropriate 
for  modeling  the  expenditure  data? 

(a)  Using  the  data  summary  of  the  expenditure  variable,  analyze  the  implica¬ 
tions  of  the  high  proportion  of  zero  expenditures  observed.  Is  this  a  violation 
of  the  normality  assumption?  Is  there  a  transformation  of  expenditure  that 
would  make  the  assumption  of  normality  more  appropriate? 

(b)  Three  candidate  models  are  considered,  each  with  the  same  set  of  covari¬ 
ates.  These  covariates  are  the  same  as  in  the  count  data  Exercise  20.6.  The 
models  are  (i)  the  Tobit  model,  (ii)  the  two-part  (“hurdle”)  model  (TPM),  and 
(iii)  the  selection  model.  Explain  how  each  one  of  these  will  be  set  up,  the  re¬ 
lationship  and  connections  among  them,  and  how  one  might  compare  and 
choose  among  them.  If  you  are  likely  to  encounter  any  specific  specifica¬ 
tions  or  estimation  problems,  state  them  and  suggest  how  you  might  handle 
them.  Pay  attention  to  the  choice  of  exclusion  restrictions. 

(c)  Estimate  in  turn  the  Tobit  model,  the  TPM,  and  the  selection  models.  For  the 
TPM  you  have  two  equations,  and  the  second  is  for  those  who  have  positive 
expenditures  only.  In  the  case  of  the  selection  model,  use  both  the  MLE 
and  the  two-step  (Heckman)  estimators.  Discuss  your  reasons  underlying 
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the  exclusion  restriction  required  in  the  estimation  of  the  selection  model.  Is 
there  evidence  that  the  selection  problem  is  a  serious  issue? 

(d)  How  can  we  compare  the  statistical  fit  of  the  three  models?  Which  model 
appears  to  provide  the  best  fit  to  the  data?  By  what  criterion? 

(e)  Suppose  our  main  interest  is  in  the  impact  of  two  variables  on  expenditure, 
log  income,  and  log  of  (1  +  coinsurance  rate).  Use  the  results  of  your  esti¬ 
mated  Tobit  model  and  TPM  to  make  a  comparison  between  the  marginal 
impact  of  a  change  in  these  variables  on  expenditure.  Given  that  there  is 
considerable  heterogeneity  in  the  sample,  suggest  how  to  present  the  re¬ 
sults  of  your  analysis  in  the  most  informative  manner. 

(f)  Briefly  explain  how  quantile  regression  (see  Section  4.6)  provides  an  alter¬ 
native  method  of  analyzing  the  same  data.  What  are  the  main  advantages 
and  disadvantages  of  this  approach  in  the  present  data  situation? 
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Transition  Data:  Survival  Analysis 


17.1.  Introduction 

Econometric  models  of  durations  are  models  of  the  length  of  time  spent  in  a  given 
state  before  transition  to  another  state,  such  as  duration  unemployed  or  alive  or  without 
health  insurance.  In  biostatistics  a  duration  in  a  state  is  also  known  as  lifetime  and  the 
time  of  transition  is  referred  to  as  death;  in  operations  research  where  one  often  studies 
lifetimes  of  physical  objects  such  as  light  bulbs  and  machines,  the  end  of  useful  life, 
that  is,  transition  to  useless  life,  is  called  failure  time.  In  econometrics  a  state  is  a 
classification  of  an  individual  entity  at  a  point  in  time,  transition  is  movement  from 
one  state  to  another,  and  a  spell  length  or  duration  is  the  time  spent  in  a  given  state.  A 
typical  regression  example  is  determining  the  effect  of  higher  unemployment  benefit 
levels  on  the  average  length  of  an  unemployment  spell  or  the  probability  of  transition 
out  of  unemployment. 

The  literature  on  this  subject  can  be  quite  daunting,  for  a  number  of  reasons.  First, 
several  related  distributional  functions  are  of  interest  and  either  the  duration  or  prob¬ 
ability  of  transition  may  be  modeled.  Second,  many  different  sampling  schemes  are 
possible  and  statistical  inference  depends  on  both  the  duration  model  and  the  sampling 
scheme.  For  example,  sampling  methods  for  data  on  unemployment  duration  include 
flow  sampling  of  those  entering  unemployment  in  a  given  month,  stock  sampling  of 
people  unemployed  in  a  given  month,  and  population  sampling  of  all  people  regardless 
of  employment  status.  Third,  the  data  on  spell  duration  are  often  censored.  This  is  a 
major  reason  for  modeling  transitions  rather  than  the  mean  duration,  the  usual  object 
of  regression  analysis,  as  weaker  distributional  assumptions  are  needed  to  consistently 
estimate  models  of  the  transitions.  Fourth,  transition  data  can  be  very  rich  with  sev¬ 
eral  states,  such  as  unemployment,  part-time  employment,  full-time  employment,  and 
out-of-the  labor  force,  and  data  for  a  given  individual  may  be  available  on  multiple 
transitions  among  these  states.  Fifth,  the  literature  appears  in  several  different  applied 
areas  of  statistics  with  different  emphases.  Duration  analysis  or  transition  analysis 
is  also  called  survival  analysis  (length  of  time  survived)  in  biostatistics,  failure  time 
analysis  (length  of  time  to  failure  of  an  item  such  as  a  light  bulb  or  a  machine  part) 
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in  operations  research,  life  table  analysis  in  demography  and  actuarial  studies  (where 
leaving  a  state  corresponds  to  death),  and  hazard  analysis  in  insurance  and  accident 
theory.  In  the  social  sciences  applications  include  recidivism,  length  of  marriages,  and 
interelection  duration. 

In  this  chapter  we  present  results  for  single-spell  duration  data  obtained  by  flow 
sampling.  The  classic  example  is  modeling  survival  time,  with  transition  being  from 
alive  to  dead,  and  many  of  the  results  come  from  survival  analysis  and  life  table  analy¬ 
sis.  This  is  the  most  studied  example  of  transition  analysis  in  statistics,  and  the  survival 
analysis  methods  presented  in  this  chapter  are  implemented  in  many  statistical  and  mi- 
croeconometric  packages.  The  chapter  begins  with  a  regression  example  to  outline  the 
issues  raised  with  survival  data. 

Sections  17.3-17.5  present  results  without  regressors,  as  many  new  concepts  arise 
even  in  this  case.  Section  17.3  introduces  basic  duration  data  concepts  such  as  the 
hazard,  cumulative  hazard,  and  survivor  functions.  Section  17.4  defines  various  types 
of  censoring,  a  common  complication  in  duration  analysis  because  the  completed  spell 
is  not  always  observed.  For  example,  a  clinical  trial  will  usually  end  before  the  last 
subject  dies.  Section  17.5  presents  nonparametric  estimators  of  the  hazard,  cumulative 
hazard  (Nelson-Aalen  estimator),  and  survivor  functions  (Kaplan-Meier  estimator) 
that  are  consistent  under  independent  censoring. 

The  remainder  of  the  chapter  extends  analysis  to  regression  models,  again  un¬ 
der  independent  censoring.  Estimation  of  fully  parametric  models,  notably  the 
Weibull  model,  is  presented  in  Section  17.6.  The  treatment  of  censoring  is  simi¬ 
lar  to  that  given  for  fully  parametric  Tobit  models.  Some  important  duration  mod¬ 
els  are  given  in  Section  17.7.  An  alternative  semiparametric  approach  is  to  in¬ 
stead  model  the  hazard  function,  the  probability  of  death  conditional  on  survival 
to  date.  In  his  seminal  paper,  Cox  (1972)  proposed  a  method  to  consistently  esti¬ 
mate  a  proportional  hazards  function  with  independent  censoring  under  relatively 
weak  distributional  assumptions.  The  Cox  model,  the  standard  model  for  survival 
data,  is  presented  in  Section  17.8.  Unlike  most  cross-section  models,  in  survival 
models  regressors  such  as  unemployment  benefits  in  an  unemployment  duration 
model  may  vary  for  a  given  person  over  the  period  that  the  subject  is  observed. 
Models  with  time-varying  regressors  are  detailed  in  Section  17.9.  Discrete  haz¬ 
ards  models  are  presented  in  Section  17.10.  Section  17.11  presents  an  empirical 
example. 

Two  subsequent  chapters  consider  more  complicated  aspects  of  transition  modelling 
that  are  rarely  given  a  textbook  treatment.  These  include  unobserved  heterogeneity, 
multiple  spells,  and  multiple  destinations. 


17.2.  Example:  Duration  of  Strikes 

Consider  a  data  set  on  the  duration  of  strikes  that  has  been  used  by  Kennan  (1985), 
Jaggia  (1991c),  and  others.  The  variable  of  interest  is  the  duration  of  strikes  in  U.S. 
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Kaplan-Meier  Survival  Function  Estimate 


Strike  duration  in  days 

Figure  17.1:  Strike  duration:  Kaplan-Meier  estimate  of  survival  function.  Data  on  completed 
spells  for  566  strikes  in  the  U.S.  during  1968-76. 


manufacturing,  measured  in  number  of  days  from  the  start  of  the  strike.  The  sample 
has  566  complete  (uncensored)  observations  on  strike  duration.  The  average  duration 
of  strike  ( dur )  is  43.6  days,  and  the  median  is  about  28  days.  However,  90  days  after 
the  start  of  the  strike  88  strikes  are  still  in  progress. 

We  can  show  the  strike  duration  information  graphically  as  an  empirical  survival 
function.  Figure  17.1  shows  on  the  vertical  axis  the  proportion  of  strikes  started  that 
are  still  in  progress  after  a  stated  number  of  days.  Calender  time  is  ignored  in  this 
figure,  meaning  that  the  different  start  date  of  different  strikes  plays  no  role  in  the 
construction  of  the  figure.  As  expected,  the  function  starts  at  one  and  monotonically 
declines  to  zero,  indicating  that  all  strikes  must  eventually  end. 

Now  introduce  a  regressor  variable  (z)  that  measures  the  deviation  of  output  from  its 
trend  level,  an  indicator  of  the  business  cycle  position  of  the  economy.  Positive  values 
of  z  indicate  above-trend  growth  period  and  negative  values  indicate  the  converse. 
Suppose  that  our  main  interest  lies  in  testing  whether  average  strike  duration  is  pro 
cyclical  (i.e.,  d(dur)/dz  >  0)  or  anticyclical  (i.e.,  d(dur)/dz  <  0).  A  simple  way  to 
proceed  might  be  to  model  the  conditional  expectation  of  In  (dur)  by  a  linear  regression 
of  In  (dur)  on  z-  This  may  serve  the  purpose  if  one  is  testing  for  the  presence  of  a 
positive  or  negative  association  between  dur  and  z. 

Possibly  we  might  instead  be  interested  in  modeling  the  conditional  probability  of 
a  strike.  Such  a  goal  could  be  achieved  by  a  binomial  regression  with  a  0/1  outcome 
variable.  However,  suppose  that  our  interest  is  in  modeling  the  probability  that  a  strike 
that  has  been  in  progress  for  t  days  will  end  on  day  t  +  1 ,  or  in  modeling  the  condi¬ 
tional  probability  of  the  strike  in  progress  ending,  as  a  function  of  the  length  of  the 
strike,  controlling  for  z;  then  the  previously  mentioned  regression  approaches  will  be 
less  direct  and  less  efficient  than  survival  analysis,  which  also  has  the  additional  ad¬ 
vantage  that  it  can  handle  censored  durations.  In  the  next  section  we  will  consider 
statistical  concepts  that  are  used  in  survival  analysis. 
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17.3.  Basic  Concepts 

Duration  in  a  state  is  a  nonnegative  random  variable,  denoted  T,  which  in  economic 
data  is  often  a  discrete  random  variable.  For  explaining  the  basic  concepts  we  focus  on 
the  continuous  case,  followed  by  the  discrete  case  later  in  the  chapter. 


17.3.1.  Survivor,  Hazard,  and  Cumulative  Hazard  Functions 

The  cumulative  distribution  function  of  T  is  denoted  Fit)  and  the  density  function 

is  /(f)  =  dF(t)/dt.  Then  the  probability  that  the  duration  or  spell  length  is  less  than  f 
is 


F{t)  =  Pr[7  <  t] 


(17.1) 


A  complementary  concept  to  the  cdf  is  the  probability  that  duration  equals  or  ex¬ 
ceeds  f,  called  the  survivor  function,  which  is  defined  by 


Sit)  =  Pr[T  >  t] 
=  1  -  Fit). 


(17.2) 


The  definition  of  the  cdf  in  (17.1)  equals  the  usual  definition,  following  Kalbfleisch 
and  Prentice  (2002).  In  the  duration  analysis  literature  other  authors,  such  as  Lan¬ 
caster  (1990)  instead  define  Fit)  =  Pr[T  <  f]  and  hence  Sit)  =  Pr[T  >  f]  because 
hazard  functions,  defined  below,  condition  on  T  >  t  rather  than  T  >  t.  The  particu¬ 
lar  definition  used  will  make  a  difference  in  the  discrete  case,  considered  in  Section 
17.3.2,  at  the  exact  time  that  a  transition  occurs. 

The  survivor  function  is  monotonically  declining  from  one  to  zero  since  the  cdf 
is  monotonically  increasing  from  zero.  If  all  individuals  at  risk  of  leaving  the  state 
eventually  do  so  then  S(o o)  =  0.  Otherwise,  S(o o)  >  0  and  the  duration  distribu¬ 
tion  is  called  defective.  The  sample  mean  of  a  completed  spell  length  is  the  integral 
/y00  Siu)du.  To  obtain  this  result,  use 

/*  oo  /»oo  n  oo 

/  ufiu)du  =  I  ud  Fiu)  =  uF{u)  |j/  —  /  Fiu)du. 

Jo  Jo  Jo 

Since  F  (oo)  =  1  and  F  (0)  =  0,  it  follows  that 


E[T]  = 


(1  —  Fiu))  du  = 


Siu)du. 


(17.3) 


The  mean  duration  equals  the  area  under  the  survival  curve. 

Another  key  concept  is  the  hazard  function,  which  is  the  instantaneous  probability 
of  leaving  a  state  conditional  on  survival  to  time  t.  This  is  defined  as 


m  = 


lim 

A/— >0 


Pr [t  <  T  <  t  +  At  \  T  >  t] 
At 


fit) 

Sit)' 


(17.4) 
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Table  17.1.  Survival  Analysis:  Definitions  of  Key  Concepts 


Function 

Symbol 

Definition 

Relationships 

Density 

fit) 

f(t)  —  dF(t)/dt 

Distribution 

Fit) 

Pr[T  <  t] 

F(t)  =  f'  fis)ds 

Survivor 

Sit) 

Pr[T  >  t ] 

/^S 

1 

II 

CO 

Hazard 

m 

Pr[f  <  T  <  t  +  h\T  >  t] 
lim - 

/i-»o  h 

m  =  fo/ so 

Cumulative  hazard 

A  it) 

/q  X{s)cls 

A  it)  =  -  In  5(f) 

It  is  easily  verified  that  the  hazard  equals  the  change  in  log-survivor  function, 

dln(S(t)) 

h~- 

The  hazard  Xft)  specifies  the  distribution  of  T.  In  particular,  integrating  Xit)  and  using 
5(0)  =  1  we  can  show  that 


5(r)  =  exp 


(17.5) 


In  regression  analysis  of  transitions  the  conditional  hazard  rate,  k(t |x),  is  of  central 
interest.  This  contrasts  with  more  standard  regression  approaches  in  which  the  condi¬ 
tional  mean  function,  E[T|x],  is  of  chief  interest.  The  latter  approach  has  the  disad¬ 
vantage  that  in  practice  the  durations  are  often  censored. 

A  final  related  function  is  the  cumulative  hazard  function  or  integrated  hazard 
function 


A(0=  [  X(s)ds  (17.6) 

Jo 

=  -  In  5(0, 


where  the  last  equality  uses  (17.5).  If  5(oo)  =  0  then  A(oo)  =  oo.  The  cumula¬ 
tive  hazard  is  of  interest  as  it  can  be  more  precisely  estimated  than  the  hazard 
function. 

For  any  choice  of  distribution  of  T,  it  can  be  shown  that  the  transformation  A (T)  is 
unit  exponentially  distributed  and  In  A(T)  is  extreme  value  distributed,  providing  the 
basis  for  model  specification  tests,  see  Section  18.7.2. 

Various  related  functions  for  the  nonnegative  continuous  random  variable  T  are 
summarized  in  Table  17.1. 

Other  functions  are  also  used  at  times,  most  notably  the  Laplace  transform  Lis)  = 
E[exp(— sT)],  3  >  0,  which  is  a  variant  of  the  moment-generating  function  for  random 
variable  T  restricted  to  be  positive. 


17.3.2.  Discrete  Data 

It  is  very  common  for  a  duration  to  be  measured  as  an  interval.  For  example,  data  may 
indicate  that  a  transition  occurred  in  a  particular  week,  but  the  exact  time  in  the  week 
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is  not  given.  In  such  cases  the  transition  times  are  said  to  be  grouped  and  it  is  assumed 
that  the  hazard  within  the  interval  is  constant.  Discrete-time  hazard  models  deal  with 
such  data. 

The  starting  point  is  to  define  the  discrete-time  hazard  function  as  the  probability 
of  transition  at  discrete  time  tj,  j  =  1,2, ,  given  survival  to  time  tj\ 

y.j=Pr[T  =  tj\T>tj]  (17.7) 

=  fd(tj)/sd  ( tj_ ) , 

where  the  superscript  d  denotes  discrete,  and  where  Sd(n_.)  =  lim,  Sd(tj),  an  ad¬ 
justment  made  because  formally  Sd(t)  equals  Pr[  T  >  t]  rather  than  Pr[  7’  >  1  ].  and  the 
superscript  d  denotes  discrete. 

The  discrete-time  survivor  function  is  obtained  recursively  from  the  hazard  func¬ 
tion  as 


S\t)  =  Pr[T  >  t]  (17.8) 

=  n  (i-M- 


For  example,  Pr  [T  >  f2]  equals  the  probability  of  no  transition  at  time  t\  times  the 
probability  of  no  transition  at  time  f2  conditional  on  surviving  to  just  before  f2,  so  that 
Pr[T  >  ti  |  =  (1  —  7.  i )  x  (1  —  k2).  The  function  Sd(t)  is  a  decreasing  step  function 
with  steps  at  tj,  j  =  1,2,.... 

The  discrete-time  cumulative  hazard  function  is 

Ad(f)  =  £  V  (17-9) 

Using  (17.7),  we  have  that  the  discrete  probability  that  the  spell  ends  at  tj  is 
XjS\tj). 

The  continuous  and  discrete  cases  can  be  combined.  The  survivor  function  is  then 
defined  using  the  product  integral,  which  reduces  to  the  regular  product  (17.8) 
in  the  discrete  case  and  to  the  exponential  of  the  regular  integral  (17.5)  in  the 
continuous  case.  See  Kalbfleisch  and  Prentice  (2002,  p.  10)  or  Lancaster  (1990, 

pp.  10-12). 

Discrete  duration  data  may  arise  because  the  process  generating  transitions  is  in¬ 
trinsically  discrete.  More  often,  however,  the  underlying  process  is  continuous  but  the 
data  are  observed  discretely.  For  example,  one  may  know  the  week  or  month  in  which  a 
spell  ends,  but  not  the  day  or  hour.  Such  data  are  sometimes  known  as  grouped  data. 
The  discrete  data  formulas  can  be  used  as  follows.  Let  time  be  divided  into  k  +  1 
intervals  [no,  a\),  [«i,  a{),  ■  ■  ■ ,  [a^_  1,  a U,  [a^,  a^).  The  discrete  time  duration  T  =  tj 
indicates  a  transition  in  the  interval  [n;-i,  tij ),  that  is,  transition  at  time  tij .  1  or  later. 
It  is  customary  to  treat  discrete  data  as  resulting  from  grouping,  so  that  transitions  are 
modeled  in  continuous  time  and  then  necessary  adjustments  are  made  for  grouping. 
Further  discussion  is  given  in  Section  17.10. 
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17.4.  Censoring 

Survival  data  are  usually  censored,  as  some  spells  are  incompletely  observed.  That 
is,  the  lifetimes  are  only  known  to  lie  in  certain  intervals.  As  an  example,  instead 
of  observing  the  length  of  completed  spell  of  unemployment,  data  may  come  from  a 
survey  of  the  currently  unemployed,  so  that  only  the  length  of  an  incomplete  spell  of 
unemployment  is  observed. 


17.4.1.  Censoring  Mechanisms 

In  practice  data  may  be  right-censored,  left-censored,  or  interval-censored.  For  right- 
censoring  or  censoring  from  above,  we  observe  spells  from  time  0  until  a  censoring 
time  c.  Some  spells  will  have  ended  by  this  time  anyway  (completed  spells),  but  others 
will  be  incomplete  and  all  we  know  is  that  they  will  end  some  time  in  the  interval 
(c,  oo).  Left-censoring  or  censoring  from  below  occurs  when  spells  are  known  to  end 
at  some  time  in  the  interval  (0,  c)  but  the  exact  time  is  unknown.  The  classical  Tobit 
model  is  an  example,  where  data  on  some  spells  are  lost  and  the  censoring  time  is 
unknown.  Interval-censoring  occurs  when  the  completed  spell  length  is  observed  but 
only  in  interval  form  such  as  in  [r*,  t*). 

The  survival  analysis  literature  has  focused  on  right-censoring.  Even  with  this  re¬ 
striction  there  are  a  variety  of  possible  reasons  for  censoring,  including  random  cen¬ 
soring,  type  I  censoring,  and  type  II  censoring. 

Random  censoring  or  exogenous  censoring  means  that  each  individual  in  the 
sample  has  a  completed  duration  T*  and  censoring  time  C*  that  are  independent  of 
each  other.  We  observe  the  completed  duration  T*  if  the  spell  ends  before  the  cen¬ 
soring  time  and  the  censoring  time  C*  if  the  spell  ends  after  the  censoring  time. 
In  addition  it  is  known  whether  or  not  censoring  has  occurred.  The  observed  data 
( h ,  <5i),  (t2,  S 2), . . . ,  (tN,  SN)  are  realizations  of  the  random  variables 

Ti  —  min(7"\  C*),  (17.10) 

Si  =  1[T*  <  Cf], 

where  the  indicator  function  1[A]  equals  one  if  event  A  occurs  and  equals  zero  oth¬ 
erwise.  Note  that  <5,  equals  one  if  a  completed  spell  is  observed  and  equals  zero 
otherwise.  Random  censoring  may  result  from  causes  such  as  random  failure  to  fol¬ 
low  up  a  case,  individuals  randomly  dropping  out  of  the  study,  or  termination  of  the 
study. 

Type  I  censoring  occurs  when  durations  are  censored  above  a  certain  fixed  known 
censoring  time,  say  tc. .  For  example,  a  sample  of  light  bulbs  may  be  tested  for  no  more 
than  5,000  hours,  with  a  common  starting  time  for  all  items.  Thus  at  the  termination  of 
the  study  the  failure  times  or  durations  of  some  items  will  be  known  but  other  objects 
will  still  not  have  “failed.”  Their  lifetimes  are  said  to  be  right-censored.  This  is  a 
special  case  of  random  sampling,  with  C*  =  tc. .  The  classic  Tobit  model  is  an  example 
of  type  I  censoring  from  below  for  a  random  variable  continuous  on  (—00,  00). 
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17.4.2.  Independent  (Noninformative)  Censoring 

For  standard  survival  analysis  methods  to  be  valid  in  the  presence  of  censoring  the 
censoring  mechanism  needs  to  be  one  with  independent  (noninformative)  censor¬ 
ing.  This  means  that  parameters  of  the  distribution  of  C*  are  not  informative  about  the 
parameters  of  the  distribution  of  the  duration  T* .  Then  one  may  treat  the  censoring  in¬ 
dicator  5  as  exogenous,  and  it  is  then  not  necessary  to  model  the  censoring  mechanism 
if  interest  lies  in  the  duration  model  parameters. 

For  censored  data  (f,  5)  the  uncensored  observations  are  observed  with  probability 

Pr[T  =  t,8  =  1]  =  Pr[T  =  t\  8  =  1]  x  Pr[5  =  1], 

If  the  censoring  mechanism  is  independent  then  Pr[  T  =  /j5  =  1]  =  Pr[  7’  =  1  j.  If  the 
censoring  is  noninformative  then  the  term  Pr[5  =  1]  can  be  dropped  from  the  likeli¬ 
hood  function  as  it  does  not  involve  parameters  of  the  distribution  for  T .  Similarly,  for 
censored  observations, 

Pr[T  =  t,  8  =  0]  =  Pr[T  >  t\  8  =  0]  x  Pr[5  =  0] 

with  Pr[J  >  t\  8  =  0]  =  Pr[T  >  t  ]  under  independent  censoring  and  Pr  [5  =  0]  being 
ignored  under  noninformative  censoring.  Combining,  the  density  of  interest  reduces  to 
Pr[T  =  r]  when  5=1  and  Pr[T  >  t ]  when  5  =  0. 

When  regressors  x  are  introduced  it  is  possible  for  T*  and  C*  to  vary  with  the  same 
regressors.  Again  what  matters  is  that  C*  parameters  are  not  informative  about  the  T* 
parameters.  Even  more  simply,  at  any  given  point  in  time,  censoring  must  not  occur 
because  a  subject  has  unusually  high  or  low  risk  of  failure  given  x. 

Type  II  censoring  occurs  when  observation  on  N  subjects  ceases  after  the  pth 
failure.  Then  only  the  durations  for  the  p  shortest  spells  are  completely  observed, 
and  the  remaining  N  —  p  are  censored  at  C*  =  t(p),  the  duration  of  the  pth  shortest 
complete  spell.  For  example,  a  clinical  trial  may  end  after  p  patients  have  died. 

Random,  type  I,  and  type  II  censoring  are  all  examples  of  independent  censoring. 
A  more  formal  treatment  is  given  in  Kalbfleisch  and  Prentice  (2002,  pp.  194-196). 


17.5.  Nonparametric  Models 

This  section  deals  with  nonparametric  estimation  of  survival  functions.  These  methods 
are  very  useful  for  descriptive  purposes.  It  is  often  insightful  to  know  the  shape  of 
the  raw  (unconditional)  hazard  or  survival  function  before  considering  introducing 
regressors.  The  strike  duration  example  illustrates  the  point. 

We  present  estimators  of  the  survivor,  hazard,  and  cumulative  hazard  functions  in 
the  presence  of  independent  censoring.  Nonparametric  estimation  of  the  density  itself 
is  not  considered  because  of  the  difficulty  introduced  by  censoring;  more  importantly 
the  survivor  and  hazard  functions  are  more  interpretable  than  the  density. 

No  regressors  are  included.  If  interest  lies  in  just  a  few  key  values  of  regressor(s), 
such  as  different  treatment  regimes  or  levels  of  treatment,  then  one  can  obtain  sep¬ 
arate  nonparametric  estimates  at  each  key  value  and  compare  them.  In  economics 
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applications  this  is  rarely  the  case  and  more  structural  models  with  regressors,  pre¬ 
sented  in  Sections  17.6-17.10,  are  needed. 

We  focus  on  discrete  durations,  such  as  life  table  data,  so  that  the  discrete-time 
formulation  of  Section  17.3.3  is  used.  Consider,  for  example,  a  cohort  of  Nq  individuals 
of  specific  age  and  gender,  which  is  subsequently  tracked  for  a  number  of  years.  At 
the  end  of  year  1,  there  are  N i  individuals  in  the  cohort,  and  N i  —  Nq  individuals 
from  the  original  cohort  have  either  died  or  been  lost  for  other  reasons  (censored). 
A  year  later  the  size  of  the  cohort  is  A4  —  Afi,  and  so  forth.  Such  life  table  data  can 
be  used  to  construct  a  discrete-time  survivor  function  without  any  prior  parametric 
assumptions. 


17.5.1.  Nonparametric  Estimation 

With  no  censoring  the  obvious  estimator  of  the  survivor  function  is  one  minus  the 
sample  cumulative  distribution  function.  Then  S(t)  equals  the  number  of  spells  in  the 
sample  of  duration  greater  than  t,  divided  by  the  sample  size  N.  This  is  a  step  function 
with  jump  at  each  discrete  failure  time;  see  Figure  17.1.  An  alternative  equivalent 
representation  of  this  estimator,  given  momentarily  in  (17.13),  maintains  consistency 
in  the  presence  of  independent  censoring. 

Let  4  <  t2  <  ■  ■  ■  <  tj  <  •  •  •  <4  denote  the  observed  discrete  failure  times  of  the 
spells  in  a  sample  of  size  N,  N  >  k.  Define  dj  to  be  the  number  of  spells  that  end 
at  time  4  .  Since  the  data  are  discrete  dj  may  exceed  one.  Some  spells  may  be  in¬ 
completely  observed.  Define  mj  to  be  the  number  of  spells  right-censored  in  the  in¬ 
terval  [tj,  tj+i).  The  censoring  mechanism  is  assumed  to  be  independent  censoring, 
so  the  only  thing  known  about  a  spell  censored  in  [4,  4+1)  is  that  the  failure  time  is 
greater  than  4.  Spells  are  at  risk  of  failure  if  they  have  not  yet  failed  or  been  censored. 
Define  r,  to  equal  the  number  of  spells  at  risk  at  time  4_,  that  is,  just  before  time 
tj.  Then  r,-  =  (dj  +  m(j  +  •  •  •  +  (dk  +  ink)  =  J2i\i>j(di  +  mi).  Note  that  r\  =  N.  In 
summary, 

dj  —  #  spells  ending  at  time  tj,  (17.1 1) 

ni  j  —  #  spells  censored  in  [4,  4+1), 

rj  —  #  spells  at  risk  at  time  4.  =  E  (di  +  mt). 

i\i>i 

The  discrete-time  formulation  of  Section  17.3.2  is  used.  Since  Xj  = 
Pr  [r  =  tj\T  >4],  an  obvious  estimator  of  the  hazard  function  is  the  number 
of  spells  ending  at  time  4  divided  by  the  number  at  risk  of  failure  at  time  4_,  or 

^  d, 

X:  =  -±.  (17.12) 

ri 

The  discrete-time  survivor  function  is  defined  in  (17.8).  The  Kaplan-Meier  esti¬ 
mator  or  product  limit  estimator  of  the  survivor  function  is  the  sample  analogue 

m=  no  -*;)=  n  r^d^.  (17.13) 

j\tj<t  ri 
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Table  17.2.  Hazard  Rate  and  Survivor  Function  Computation:  Example a 


j 

n 

dj 

Xi  =  di/rj 

Mtj) 

S(fj) 

1 

80 

6 

4 

6/80 

6/80 

(1-6/80) 

2 

70 

5 

3 

5/70 

6/80  +  5/70 

(1— 6/80)  x  (1—5/70) 

3 

62 

2 

1 

2/62 

A(f2)  +  2/62 

5(I2)x  (1—2/62) 

4 

— 

— 

— 

— 

“  At  time  1, .  rj  is  the  number  of  observations  at  risk,  dj  is  the  number  of  deaths  (failures),  m  j  is  the  number  of 
missing  spells  (censored),  X:  is  the  estimated  hazard  rate,  A Up  is  the  estimated  cumulative  hazard,  and  S(tj) 
is  the  estimated  survivor  function. 


This  is  a  decreasing  step  function  with  jump  at  each  discrete  failure  time.  The  Kaplan- 
Meier  estimator  can  be  shown  to  be  the  nonparametric  MLE  (see  Kalbfleisch  and 
Prentice,  2002,  pp.  14-16). 

In  the  case  of  no  censoring  S(t )  in  (17.13)  simplifies  to  S(t)  =  r/N,  the  number 
still  at  risk  at  time  t  divided  by  the  sample  size,  which  is  one  minus  the  empirical  cdf. 
To  see  this  note  that  r;  —  dj  =  rj+ 1,  if  m;  =  0,  since  then  the  number  at  risk  at  time 
j  less  the  number  of  deaths  at  time  j  equals  the  number  at  risk  at  time  j  +  1 .  Then 
(17.13)  becomes  S(t )  =  Wj\t<t  ri+i/rT  which  simplifies  to  r/r\  where  r\  =  N. 

The  discrete-time  cumulative  hazard  function  is  defined  in  (17.9).  The  Nelson- 
Aalen  estimator  of  the  cumulative  hazard  function  is  the  obvious  sample  analogue 

a(o=E^-=E^-  (17.14) 

j\tj<t  i\tj<t  rj 


This  estimator  can  also  be  used  to  estimate  the  survival  function  by  S(tj)  = 
exp(— A(/)),  using  the  continuous  case  equality  S(t)  =  exp(— A(t)). 

As  an  illustration,  suppose  that  there  are  initially  80  observations,  with  6  failures  at 
time  t\,  4  spells  censored  in  [0,  f2),  5  failures  at  time  f2,  3  spells  censored  in  [t2,  h ), 
2  failures  at  time  C.  1  spell  censored  in  [(3, 14),  and  so  on.  Then  the  estimates  for  the 
cumulative  hazard  and  survivor  function  for  t  <  (3  are  given  in  Table  17.2. 

Tied  data  arise  when  multiple  failures  occur  at  a  particular  point  in  time.  It  is  com¬ 
mon  to  assume  that  ties  occur  because  of  grouping,  rather  than  because  the  process 
generates  true  discrete  ties.  The  hazard  estimate  Xj  =  dj/rj  assumes  that  all  deaths 
occur  simultaneously  at  time  tj.  In  fact  deaths  may  occur  progressively  over  the  in¬ 
terval  [tj,  tj+ 1)  and  censoring  may  also  occur  progressively  over  this  interval.  Then 
rj  overstates  the  number  of  subjects  at  risk  on  average  over  the  interval  [  tj ,  t]+\  ).  A 
standard  correction  in  life  table  analysis  is  to  replace  /,/  =  dj/rj  by  dj/(rj  —  mj/ 2), 
with  similar  changes  in  the  formulas  for  S(t),  A(t).  and  so  on.  Other  corrections  have 
also  been  proposed. 

Most  survival  analysis  programs  do  a  good  job  of  producing  basic  Kaplan-Meier 
plots  and  tables.  Table  17.3  provides  an  abstract  of  such  output  for  the  strike  data  and 
complements  Figure  17.1  given  earlier. 
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Table  17.3.  Strike  Duration:  Kaplan-Meier  Survivor  Function 
Estimates 


Day 

Beginning 

Total 

Failures 

Survivor 

Function 

Standard 

Error 

1 

566 

10 

0.9823 

0.0055 

2 

556 

21 

0.9452 

0.0096 

3 

535 

16 

0.9170 

0.0116 

4 

519 

17 

0.8869 

0.0133 

5 

502 

18 

0.8551 

0.0148 

6 

484 

9 

0.8392 

0.0154 

7 

475 

12 

0.8180 

0.0162 

8 

463 

12 

0.7968 

0.0169 

13 

411 

11 

0.7067 

0.0191 

14 

400 

11 

0.6873 

0.0195 

17.5.2.  Confidence  Bands  for  Nonparametric  Estimates 

The  estimate  Xj  =  dj / rj  of  the  hazard  function  is  very  discontinuous,  especially  for  t 
large  as  then  r;  becomes  small  relative  to  dj  /rj .  It  can  be  visually  useful  to  first  smooth 
the  hazard  estimates,  using  nonparametric  regression  methods,  see  Section  9.5,  before 
plotting  them  against  time. 

The  survivor  and  cumulative  hazard  functions  are  much  smoother,  and  it  is  standard 
to  plot  these  against  time,  along  with  confidence  bands  that  do  reflect  sampling  vari¬ 
ability.  There  are  several  ways  to  estimate  these  confidence  bands.  The  formulas  we 
give  are  those  used  in  STATA. 

For  the  Kaplan-Meier  estimate  of  the  survivor  function  it  is  common  to  use  the 
Greenwood  estimate  of  the  variance 


V[?(r)]  =  S(t)2  £ 


dJ 

fjir,  -  dj)' 


Reported  confidence  intervals  for  S(t)  are  often  based  on  ln(—  In  S(t))  rather  than 
on  S(t),  as  this  transformation  ensures  the  confidence  interval  lies  in  the  range  of 
the  survivor  function,  which  is  between  zero  and  one.  The  transformation  yields  the 
100(1  —  a)%  confidence  interval 


Sd(t)  e  (S(r)exp(~z“/2?(f)),  S(r)  exp(::“/2i?('M),  (17.15) 


where  cr(t)  denotes  the  standard  deviation  of  ln(—  In  S(t)),  which  is  estimated  using 

_2/  YLj  tj<,  dj/(rj(rj  -  dj)) 

<rs(t)  =  7 — - 72- 

[Ljltj<M(n  -  dj)/dj)\ 
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Table  17.4.  Exponential  and  Weibull  Distributions:  pdf,  cdf  Survivor 
Function,  Hazard,  Cumulative  Hazard,  Mean,  and  Variance 


Function 

Exponential 

Weibull 

fit) 

y  exp i-yt) 

yata~x  exp(—  yta) 

Fit ) 

1  -  exp  i-yt) 

1  —  exp(— yta) 

Sit) 

exp  i-yt) 

exp  i—yt01) 

m 

V 

yata~x 

A  it) 

yt 

yt01 

E  [T] 

y1 

+  1) 

V[T] 

y~2 

y-2/«[F(2a_1  +  1)  -  [r (a”1  +  l)]2] 

y ,  a 

y  >  0 

y  >  0,  a  >  0 

For  the  Nelson-Aalen  estimator  of  the  cumulative  hazard  function  one  variance 
estimate  is 

V[A(f)]=  £  %. 

m*  rJ 

The  transformation  In  A it),  yields  the  100(1  —  a)%  confidence  interval  for  the  cumu¬ 
lative  hazard 

A(r)  e  [A(f)exp(— z„/2CTA(f)),  A(f)exp(z„/2CTA(t))] ,  (17.16) 

where  er  A(f)  denotes  the  standard  deviation  of  In  Ait),  which  is  estimated  using 

o\{t)  =  V[A(t)]/[A(t)2]. 

17.6.  Parametric  Regression  Models 

We  begin  by  outlining  the  properties  of  two  distributions  that  perform  a  benchmark 
role.  Then  some  standard  regression  models  for  duration  data  are  considered. 


17.6.1.  Exponential  and  Weibull  Distributions 

The  natural  parametric  starting  point  is  the  exponential,  because  a  pure  Poisson  point 
process  has  durations  that  are  exponentially  distributed,  see  Lancaster  (1990,  p.  86). 
The  exponential  duration  distribution  has  a  constant  hazard  rate  y  that  does  not 
vary  with  t,  the  memoryless  property  of  the  exponential.  It  follows  from  (17.5)  that 
Sit)  =  exp(—  ft  ydu)  =  exp(— yt).  The  density  is  fit )  =  — S'(t )  =  y  exp(—  yt),  and 
the  cumulative  hazard  A  (?)  =  —  In  Sit)  =  yt  is  linear  in  t. 

The  exponential  is  a  one -parameter  distribution  that  is  too  restrictive  in  practice.  A 
generalization  commonly  used  in  econometrics  is  the  Weibull  distribution.  Table  17.4 
presents  the  density  and  other  distributional  functions  and  moments  for  the  Weibull  and 
the  exponential,  which  is  the  special  case  a  =  1.  The  function  F(-)  given  in  the  Table 
17.5  is  the  gamma  function. 
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Table  17.5.  Standard  Parametric  Models  and  Their  Hazard  and  Survivor  Functions a 


Parametric  Model 

Hazard  Function 

Survivor  Function 

Type 

Exponential 

Weibull 

Generalized  Weibull 
Gompertz 

Log-normal 

V 

yat “-1 
yaH-'SCt)-^ 
y  exp(at) 

exp(— (In  t— /r)2/2cr2) 
to\/2n[\— 0((lnJ— jLt)/cr)] 

exp(-yt) 
exp  (—yta) 

[1  -  iiyt01]1^ 
exp {-(y/a)(e°"  -  1)) 

l_<D((ln  t-n)/a) 

PH,  AFT 
PH,  AFT 
PH 

PH 

AFT 

Log-logistic 

ay“f“-1/[(i  +  (yf)“)] 

1/ [1  +  (/0“] 

AFT 

Gamma 

yiylf-1  exp[— (}//)] 

r(a)[l-/(a,Kr)] 

1  —  I  {a,  yt ) 

AFT 

“  All  the  parameters  are  restricted  to  be  positive,  except  that  —  oo  <  a  <  oo  for  the  Gompertz  model. 


The  Weibull  has  hazard  X(t)  =  yat “_1,  which  is  monotonically  increasing  if  a  >  1 
and  monotonically  decreasing  if  a  <  1.  This  is  a  special  case  of  the  proportional 
hazards  (PH)  family,  see  Section  17.7.1,  in  which  X(t)  factors  into  a  baseline  com¬ 
ponent  that  depends  only  on  t,  A.q (t),  and  a  second  term  (e.g.,  y)  that  can  be  pa¬ 
rameterized  as  a  function  of  covariates  only.  Figure  17.2  presents  properties  of  the 
Weibull  distribution  with  y  =  0.01  and  a  =  1.5.  The  density  is  right-skewed,  as  is 
usually  the  case  with  duration  data.  The  shape  of  the  survivor  curve  is  one  com¬ 
mon  for  many  different  distributions,  making  visual  comparison  of  different  estimated 
survivor  curves  difficult.  The  hazard  is  increasing  for  this  Weibull  example,  since 


Weibull  Distribution 


Figure  17.2:  Weibull  distribution:  density,  survivor,  hazard  and  cumulative  hazard  functions 
plotted  against  time  for  y  =  0.01  and  a  =  1 .5. 
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a  >  1 .  Other  parametric  models  can  have  quite  different  shaped  hazard  functions,  in¬ 
cluding  monotonically  increasing,  monotonically  decreasing,  [/-shaped  and  inverse 
[/-shaped. 

The  hazard  function  is  often  imprecisely  estimated  in  practice,  especially  in  the 
right  tail.  The  cumulative  hazard  Air)  is  more  precisely  estimated  and  permits  some 
discrimination  across  models.  Even  better  is  In  A  (?)  plotted  against  In  t,  since  for  the 
Weibull  model  In  A (t)  =  In  y  +  a  In  t  is  linear  in  In  t  with  slope  a. 


17.6.2.  Some  Parametric  Models 

Popular  choices  for  parametric  models  include  the  exponential,  Weibull,  Gompertz, 
log-normal,  log-logistic,  and  the  gamma.  The  hazard  and  survivor  functions  for  these 
models  are  in  Table  17.5. 

For  the  gamma,  T(a)  =  /()°°  e~‘ta~idt,  is  the  gamma  function  and  I  (a,  yt)  is  the 
incomplete  gamma  function,  where  /(a,  x)  =  /0'  e~,ta~ldt/ r(a),  0  <  I(a,x )  <  1. 

The  generalized  Weibull  model  was  suggested  by  Mudholkar,  Srivastava,  and  Kollia 
(1996).  Through  the  introduction  of  additional  shape  parameter  fi  in  the  Weibull,  it 
overcomes  an  important  restriction  of  that  model  and  allows  the  hazard  function  to 
have  a  more  flexible  shape.  The  Weibull  model  is  obtained  in  the  limit  as  /x  — 0. 
From  Table  17.5  note  that 

In  k(t)  =  In  {ya)  +  (a  —  1)  In  t  —  jx  In  S  (t) . 

Because  3  In  S  ( t )  /dt  <  0,  the  right-hand  side  of  this  equation  is  increasing  in  t  if 
lx  >  0  and  a  >  1 .  If  a  <  1  and  /i  <  0.  then  the  hazard  function  is  monotonically  de¬ 
creasing.  If  a  >  1  and  /i  <  0,  then  the  hazard  function  has  two  components,  one  of 
which  is  a  decreasing  function  and  the  other  an  increasing  function  in  t.  Hence  the 
two  together  can  generate  a  unimodal  or  U-shaped  hazard  function.  Therefore,  the 
generalized  Weibull  is  a  potentially  flexible  and  useful  functional  form. 

The  Gompertz  is  similar  to  the  Weibull  as  it  has  hazard  function  that  can  be  mono¬ 
tonically  increasing  (if  a  >  0)  or  monotonically  decreasing  if  (a  <  0),  with  the  expo¬ 
nential  as  a  special  case  (a  =  0).  The  Gompertz  is  a  good  model  for  mortality  data  and 
is  used  more  in  biostatistics  than  econometrics. 

The  log-normal  distribution  has  an  inverted  bathtub  hazard  that  first  increases  with 
t  and  then  decreases  with  t.  So  too  does  the  log-logistic,  for  a  >  1.  These  models  are 
clearly  more  appropriate  than  exponential,  Weibull,  and  Gompertz  for  duration  data 
with  this  property. 

Other  parametric  models  include  models  based  on  the  Rayleigh  and  Makeham  dis¬ 
tributions,  inverse -Gaussian  piecewise  continuous  hazards  model,  and  the  generalized 
gamma  model  (Fawless,  1982),  which  nests  the  gamma  and  Weibull  models  as  spe¬ 
cial  cases.  Many  parametric  models  are  presented  in  detail  in  Kalbfleisch  and  Prentice 
(1980,  chapter  3)  and  Fancaster  (1990,  chapter  3). 

The  distributions  are  generally  two-parameter  distributions.  Regressors  are  intro¬ 
duced  by  letting  y  =  exp(x'/3)  with  a  left  as  a  constant,  but  for  the  log-normal  /x  =  x'/3 
and  cr2  is  left  as  a  constant. 
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The  main  issues  in  parametric  modeling  are  the  dependence  on  correct  model  spec¬ 
ification  for  consistent  parameter  estimates  and  the  wide  range  of  parametric  models 
that  are  available.  Most  models  can  be  classified  as  either  a  PH  model  (the  first  four 
in  Table  17.5)  or  an  accelerated  failure  time  model  (the  first  two  and  the  last  three 
models  in  Table  17.5).  The  Weibull  model,  a  member  of  both  classes,  is  widely  used 
in  economics  applications.  Another  widely  used  model,  particularly  for  economics  ap¬ 
plications  in  which  many  observations  are  available,  is  the  piecewise  constant  hazard 
model,  which  is  a  special  case  of  the  PH  model. 


17.6.3.  Maximum  Likelihood  Estimation 

We  now  consider  fully  parametric  analysis  with  independent  or  noninformative  cen¬ 
soring,  with  estimation  by  ML  and  by  least  squares.  The  continuous  duration  formu¬ 
lation  is  used  since  parametric  models  are  based  on  continuous  distributions.  The  re¬ 
gressors  are  assumed  to  be  time-invariant,  with  time-varying  regressors  deferred  to 
Section  17.9. 

Let  T*  denote  durations  without  censoring,  with  conditional  density  f(t |x,  9), 
where  9  is  a  q  x  1  parameter  vector  and  x  are  regressors  that  can  vary  across  sub¬ 
jects  but  do  not  vary  over  a  spell  for  a  given  subject.  Estimation  is  complicated  by 
the  presence  of  censoring.  Then  the  observed  duration  t  is  the  length  of  a  possibly 
incomplete  spell,  and  the  data  are  augmented  by  a  variable  indicating  the  presence  of 
censoring,  which  is  assumed  to  be  noninformative. 

From  Section  17.4.2,  the  treatment  is  similar  to  that  for  the  Tobit  model.  For  uncen¬ 
sored  observations  the  contribution  to  the  likelihood  is  f(t  |x,  9).  For  right-censored 
observations  we  know  only  that  the  duration  exceeded  t,  so  the  contribution  is 

OO 

Pr[T  >  t]  —  f  f(u\x,  G)clu 

t 

=  1  —  F{t\x,  9)  =  5(/|x,  9), 

where  .S' ( • )  is  the  survivor  function.  The  density  for  the  ith  observation  can  be  written 
as 


f(tl\xl,9tS(tl\xl,0)'-&-, 


where  5,  is  a  right-censoring  indicator  with 


Si  = 


1  (no  censoring), 

0  (right-censoring). 


Taking  logs  and  summing,  we  have  that  the  MLE  9  maximizes  the  log-likelihood 


In 


L(0)  = 


N 


E 


[‘L  ln/(fi|x/,  0)  +  (1  —  <5; ) In 5(f,jx(-,  9)\ , 


(17.17) 


where  independence  over  i  has  been  assumed.  The  first  term  in  the  sum  corresponds 
to  completed  spells  and  the  second  term  to  right-censored  spells.  Since  In  S(t )  =  Air) 
and  In  f(t)  =  \x\('/,(t)S(l))  =  In  k(t)  +  In  S(t),  this  log-likelihood  can  alternatively  be 
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written  in  terms  of  the  hazard  and  integrated  hazard  functions: 

N 

In  L(0)  =  J][5,  lnMt,|x/,0)  +  A(f(|x,-,0)].  (17.18) 

i= 1 

This  result  is  useful  if  the  parametric  model  is  defined  by  specifying  the  hazard  rate 
rather  than  the  pdf. 

The  usual  estimation  theory  applies.  The  MLE  will  be  distributed  as  0  ~ 
A f  [0,  (— E[92  lnL/3030'])-1]  if  the  density  is  correctly  specified,  see  Section  5.7.3. 
If  the  density  is  incorrectly  specified,  however,  the  MLE  is  inconsistent.  The  one  no¬ 
table  exception  is  the  exponential  duration  model  in  the  absence  of  censoring,  for 
which  consistency  requires  only  that  the  conditional  mean  function  be  correctly  spec¬ 
ified;  see  Section  5.7.3.  However,  inconsistency  under  misspecification  arises  even  for 
the  exponential  model  if  censoring  is  introduced,  and  it  arises  for  other  parametric  du¬ 
ration  models  even  without  censoring.  This  lack  of  robustness  is  the  major  weakness 
of  the  parametric  approach,  just  as  in  the  Tobit  model  case. 

The  ML  approach  can  be  adapted  to  permit  other  types  of  censoring.  With  left- 
censoring,  the  spell  is  known  to  be  of  length  at  most  t,  and  the  likelihood  contribution 
is  Pr[T*  <  t]  =  Jq  /(s|x,  6)ds  =  F{t |x,  6). 

With  interval-censoring  the  data  are  known  to  lie  in  [  ta .  tb)  and  the  likelihood 
contribution  is  Pr[f„  <  T*  <  4]  =  ff*  /(s|x,  6)ds  =  S(ta  |x,  6)—  5(4  |x,  6). 

Duration  data  in  economics  applications  are  often  interval-censored.  For  example, 
unemployment  durations  may  be  grouped  into  weeks  and  months,  yet  the  parametric 
model  is  a  continuous  distribution  such  as  the  Weibull.  It  is  usually  assumed  that  the 
effect  of  interval-censoring  is  sufficiently  minor  so  that  the  interval-censoring  can  be 
ignored.  For  example,  a  person  who  is  unemployed  after  two  months  but  no  longer 
unemployed  after  three  months  may  be  treated  as  having  an  unemployment  spell  of 
exactly  three  months,  rather  than  a  spell  in  the  range  of  two  to  three  months. 


17.6.4.  Components  of  Likelihood 

Given  a  mix  of  data,  with  durations  that  may  be  complete,  truncated,  or  censored  in 
one  of  the  aforementioned  ways,  maximum  likelihood  of  a  parametrically  specified 
model  requires  one  to  set  up  the  likelihood  function.  (Lancaster  (1979)  displays  dif¬ 
ferent  likelihood  expressions  appropriate  for  three  different  data  setups  for  unemploy¬ 
ment  durations.)  Each  type  of  observation  contributes  a  term  to  the  likelihood  function, 
and  the  full  likelihood  is  formed  by  taking  appropriate  products  of  terms  such  as  the 
following  (see  Klein  and  Moeschberger,  1997,  p.  66): 

complete  durations:  f  (t) , 

left-truncated  at  4  U  A  4) :  /  ( t )  /S  (4) , 

left-censored  at  4L:  1  —  5  (4,  ) , 

right-censored  at  4R:  S  (fcR) , 

right-truncated  at  4R  ( t  <  4) :  f  (4)  /  [1  -  S  (4)]  ■ 
interval-censored  at  4L,  4R:  S  (fcL)  —  S  (tcR)  ■ 
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17.6.5.  Weibull  MLE  Example 

The  Weibull  distribution  is  presented  in  detail  in  Section  17.6.1.  The  hazard  function 
is  X(t)  =  yata~l,  where  a  >  0  and  y  >  0. 

Regressors  can  be  introduced  in  many  possible  ways,  but  the  usual  specification 
is  to  let  y  =  exp(x'/3),  which  ensures  y  >  0,  while  a  does  not  vary  with  regressors. 
(Some  programs  instead  specify  y  =  exp(— x'/3),  which  leads  to  a  reversal  in  the  signs 
of  the  estimates  of  (3.)  Then 

In  /(f  |x,  (3,  a)  —  In  [exp(x'/3)ar“_1  exp(—  exp(x'/3)r“)] 

=  x'/3  +  In  a  +  (a  —  1 )  Inf  —  exp(x'/3)f " 


and 


In  S(t  |x,  f3,  a)  =  In  [exp(—  exp(x'/3)t")] 
=  —  exp(x'/3)f“. 


The  likelihood  function  (17.17)  becomes 

In  L  =  ^2  [S,  {x-/3  +  In  a  +  (a  -  1 )  In  t,  -  exp(x'/3)r“}  -  (1  -  <5,  )exp(x-/%“] . 

i 

(17.19) 

The  first-order  conditions  for  (3  and  a  are 

=  J2  (s''  -  exP (x'iPK)  X;  =  o, 

=  V  S,  (1/a  +  lnt,)  ~  In?,-  exp(x'/3)?“  =  0. 
da 

i 

Consistency  clearly  requires  strong  assumptions.  For  example,  even  with  no  censoring 
E[3  In L/3/3]  =  0  requires  E[T“|x]  =exp(— x'(3). 


17.6.6.  Use  of  Model  Estimates 

The  usual  way  to  interpret  estimates  of  nonlinear  regression  models  is  to  consider  the 
effect  of  regressors  on  the  conditional  mean.  If  y  =  exp(x'/3)  then  from  Table  17.4 
the  completed  Weibull  durations  have  mean  E[T*|x]  =  exp(— x' (3 /a)r (a~l  +  1)  = 
exp(— x'f3/a)r(a^1)/a.  One  can  calculate  the  expected  length  of  completed  spells  at 
various  values  of  x.  For  example,  the  length  of  completed  unemployment  for  a  person 
of  given  age,  gender,  and  education  level,  say,  can  be  predicted  postestimation. 

Parametric  regression  models  also  permit  prediction  of  aspects  of  durations  other 
than  just  the  sample  mean.  For  example,  interest  may  lie  in  what  fraction  of  population 
total  time  in  completed  unemployment  spells  is  due  to  spells  in  excess  of  a  given  length 
or  is  experienced  by  individuals  in  a  given  socioeconomic  group.  The  econometrics  of 
duration  models  focuses  on  the  role  of  covariates  but  it  is  especially  concerned  with  the 
shape  of  the  hazard  function,  notably  because  some  economic  theories  make  explicit 
predictions  about  the  shape  of  the  hazard  function. 
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Despite  these  possibilities,  interpretation  of  estimates  of  parametric  duration  mod¬ 
els  often  focuses  on  the  Weibull  hazard  rate  X(t)  =  yata~l  and  how  it  changes  over 
time  and  with  changes  in  regressors.  As  noted  in  Section  17.3.2,  this  hazard  rate  is 
increasing  if  a  >  1  and  is  decreasing  if  a  <  1  so  that  one-sided  tests  of  a  =  1  are 
obviously  of  interest.  For  changes  in  regressors 

dX(t)/dx  —  exp(x'  f3)atol~1  (3  —  X(t)(3, 

so  that  changes  in  regressors  have  the  effect  of  a  multiplicative  change  in  the  hazard 
function.  A  positive  coefficient  fi  j  therefore  implies  an  increase  in  the  hazard  rate  as  a 
component  of  x  increases.  Thus  if  >  0  an  increase  in  xj  leads  to  an  increase  in  the 
hazard  of  failure  and  hence  to  a  decrease  in  the  expected  duration. 


17.6.7.  Least-Squares  Estimation 

Estimation  of  fully  parametric  models  can  be  by  least  squares  rather  than  MLE,  simi¬ 
lar  to  the  censored  Tobit  model.  We  present  results,  although  least-squares  regression 
sees  little  use  in  practice  because  the  methods  still  rely  on  correct  specification  of  the 
density  and  yet  are  less  efficient  than  the  MLE. 

We  begin  with  the  exponential  duration  regression  model.  Then  E[ 7’ |x]  =  1  /y  = 
exp(— x'/3),  so  that  NLS  regression  of  f,  on  exp(— xJ/3)  gives  a  consistent  though  in¬ 
efficient  estimator  for  /3.  Alternatively,  the  exponential  duration  model  can  be  written 
as  Inf  =  x'/3  +  u,  where  u  is  extreme  value  distributed  (see  Section  17.7.2).  Then 
E[ln  T|x]  =  x'/3  —  c,  where  c  —  0.5722  is  Euler’s  constant.  So  (3  can  be  consistently 
estimated  by  linear  regression  of  lnt,  on  Xj.  With  right-censoring  we  need  to  obtain 
analytical  censored  moments,  which  is  possible  for  the  exponential. 

Extensions  can  be  made  using  the  more  general  results  of  Kiefer  (1988,  p.  665).  He 
considers  the  PH  model  (17.21)  with  </>(x' (3)  =  expix'/d).  Then 

A(f|x)  =  Xo(t,  a)exp(x'/3). 

Then  an  expression  for  the  baseline  integrated  hazard  can  be  derived  as  follows: 

f  X(s\x)ds  —  f  Xo(s,  a)exp  (x'/3)  ds,  (17.20) 

Jo  Jo 

A(r|x)  =  Ao(t,  a)  exp  (x'/3)  , 

In  A(f|x)  =  In  Ao(t,  a)  +  x'/3, 

—  In  Ao(f ,  a)  —  x' (3  —  In  A(f  |x) 

=  xf3  +  u, 

where  the  error  term  u  =  —  In  A(t\x)  is  type  I  extreme  value  distributed. 

This  result  holds  regardless  of  the  choice  of  baseline  hazard.  We  interpret  this  result 
in  the  following  way.  For  a  particular  choice  of  baseline  hazard  Ao(t,  a),  a  convenient 
transformation  of  the  dependent  variable  t  is  —  In  Ao (t,  a),  since  it  can  be  expressed 
as  a  linear  regression  model  with  error  term  that  is  type  1  extreme  value  distributed. 
For  the  exponential,  already  discussed,  In  Ao(t,  a)  =  lnr  whereas  for  the  Weibull 
In  Ao(t,  a)  =  a  Inf.  In  censored  samples  we  obtain  E[ln  Ao(T,  a)\T  >  t*]  using 
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results  for  the  censored  type  1  extreme  value,  and  then  follow  a  Heckman  two-step 
procedure.  These  results  can  also  be  used  as  the  basis  for  simple  diagnostics;  this  topic 
is  discussed  in  the  next  chapter. 


17.7.  Some  Important  Duration  Models 

Perhaps  the  most  widely  used  formulation  used  in  regression  analysis  of  durations  is 
the  proportional  hazard  model.  However,  familiarity  with  some  of  its  variants  and  with 
the  accelerated  failure  time  (AFT)  models,  discussed  in  Section  17.7.2,  is  also  helpful. 


17.7.1.  Proportional  Hazards  Model 

In  a  proportional  hazard  model,  as  previously  mentioned,  the  conditional  hazard  rate 
A.(f  |x)  can  be  factored  into  separate  functions  of 

A.(t|x)  =  k0(f,  a)0(x,  P),  (17.21) 

where  Xq (t,  a)  is  called  the  baseline  hazard  and  is  a  function  of  t  alone,  and  </>(x,  (3) 
is  a  function  of  x  alone.  Usually  </>(x,  (3)  =exp(x'/3).  Polynomial  baseline  hazards 
are  popular  in  the  literature. 

All  hazard  functions  A.(f  |x)  of  form  (17.21)  are  proportional  to  the  baseline  hazard, 
with  scale  factor  0(x,  (3)  that  is  not  an  explicit  function  of  t.  The  PH  model  is  widely 
used  as  the  parameters  (3  can  be  consistently  estimated  without  specification  of  the 
functional  form  for  Xq(-)  (see  Section  17.8). 

The  exponential,  Weibull,  and  Gompertz  regression  models  are  all  PH  models,  since 
their  hazards  are,  respectively,  exp(x'/3),  exp(x,/3)af“_1,  and  exp(x'/3)  exp(af). 

Another  example  of  the  PH  model,  used  especially  in  applications  to  unemploy¬ 
ment  durations,  is  the  piecewise  constant  hazard  model,  which  lets  Ao(f ,  a)  be  a  step 
function  with  k  segments  so  that 

ko (T  a)  =  ea‘ ,  Cj_ \  <  t  <  Cj,  j  =  1, . . . ,  k,  (17.22) 

where  co  =  0,  a  =  oo,  the  other  breakpoints  c\, ... ,  q_i  are  specified,  and  the  pa¬ 
rameters  a\, . . . ,  oik  are  to  be  estimated.  These  parameters  are  exponentiated  to  ensure 
Xo(t,  a)  >  0.  This  model  has  more  baseline  parameters  to  estimate  than  models  such 
as  the  Weibull,  which  has  only  one  baseline  hazard  parameter,  but  can  still  be  practical 
with  a  sufficiently  large  data  set. 

The  identifiability  of  the  PH  model  in  the  presence  of  unobserved  heterogeneity  is 
discussed  in  Section  18.3. 


17.7.2.  Accelerated  Failure  Time  Model 

An  AFT  model  arises  by  first  modeling  In  t  rather  than  t.  A  regression  model  is  speci¬ 
fied  for 


In  t  =  x'  j3  +  u , 


(17.23) 
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and  different  distributions  for  u  lead  to  different  AFT  models.  Since  In  t  can  take  values 
on  (—00,  00)  the  distribution  for  u  can  be  any  continuous  distribution  on  (—00,  00). 

The  term  accelerated  failure  time  arises  because  t  =  exp(x'/3)i>,  where  v  =  e“,  has 
hazard  rate  \(t  |x)  =  Ao(i>)  exp(x'/3),  where  the  baseline  hazard  Xq(v)  does  not  depend 
on  t.  Substituting  v  =  t  exp(— x'/3)  yields  the  hazard 

A(f|x)  =  ko(t  exp(— x'/3))  exp(x'/3).  (17.24) 

This  is  an  acceleration  of  the  baseline  hazard  Ao(t)  if  exp(— x'/3)  >  1  and  a  deceleration 
ifexp(— x'/3)  <  1. 

The  log-normal  model  for  t  results  if  m  ~  Af[0,  cr2\:  the  log-logistic  model  is  ob¬ 
tained  by  specifying  u  to  be  logistic  distributed.  The  gamma  model  can  also  be  ob¬ 
tained  as  an  AFT  model,  by  letting  u  have  density  f(u)  =  exp (au  —  e“  )/  T(a). 

The  Weibull  and  exponential  models  are  unique  in  being  of  both  PFI  form  and  AFT 
form.  The  latter  form  is  obtained  by  letting  u  be  aw,  where  w  is  extreme  value  dis¬ 
tributed  with  density  f(w)  =  ew  exp(— ew). 

Additional  duration  models  can  be  obtained  by  considering  g(t)  =  x'/d  +  u,  for 
transformations  other  than  g(t)  =  In  t.  This  is  a  member  of  the  class  of  transformation 
models,  which  includes,  for  example,  the  Box-Cox  regression  model. 


17.7.3.  Flexible  Flazard  Models 

Some  models  begin  with  specification  of  the  hazard  rate,  rather  than  the  pdf.  For  exam¬ 
ple,  the  hazard  may  be  specified  to  be  quadratic  in  t,  such  as  X(t)  =  x'(3  +  a\t  +  a2t2. 
This  permits  a  U-shaped  hazard  function.  The  corresponding  integrated  hazard  is 
A(r)  =  ( x'(3)t  +  (a\/2)t2  +  («2/3)?3 .  Given  X(t)  and  A (t)  we  can  directly  form  the 
log-likelihood,  using  the  earlier  result. 

The  weaknesses  of  this  approach  are  that  negative  values  of  A  and  A  may  occur 
and  that  the  hazard  rate  may  be  defective  as  the  corresponding  pdf  may  not  necessarily 
integrate  to  unity. 


17.8.  Cox  PH  Model 

Fully  parametric  models  for  single-spell  duration  data  are  relatively  simple  to  estimate 
in  the  presence  of  censoring  but  produce  inconsistent  parameter  estimates  if  any  part  of 
the  parametric  model  is  misspecified.  One  way  of  resolving  this  impasse  is  to  choose 
parametric  functional  forms  that  are  flexible  and  hence  provide  some  protection  against 
misspecification.  Although  this  is  a  valid  approach  in  principle,  identification  and  es¬ 
timation  of  such  flexible  functional  forms  is  not  always  straightforward.  An  example 
is  the  generalized  gamma  model,  which  many  users  find  difficult  to  estimate. 

Fortunately,  there  is  a  semiparametric  method  that  requires  less  than  complete 
distributional  specification.  The  method  differs  considerably  from  semiparametric 
methods  proposed  for  Tobit  models,  where  similar  issues  of  model  robustness  under 
censoring  arise,  as  it  is  based  on  a  model  for  the  hazard  rate  that  has  no  meaningful 
physical  interpretation  in  the  Tobit  case.  In  addition,  unlike  the  Tobit  case,  the  method 
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is  viewed  as  empirically  so  successful  that  it  has  become  the  standard  method  for 
survival  data. 


17.8.1.  Proportional  Hazards  Model 

The  starting  point  is  to  propose  a  particular  functional  form  for  the  hazard  rate,  the 
proportional  hazard  model,  introduced  in  Section  17.7.1,  with  conditional  hazard  rate 
A(f|x)  factored  into  separate  functions  of 

A(f|x,  f3)  =  X0(t)<t>(x,  f3).  (17.25) 

As  before,  the  function  Xq (t)  is  called  the  baseline  hazard  and  is  a  function  of  t  alone. 
The  function  0(x,  (3)  is  a  function  of  x  alone,  where  initially  we  consider  time-invariant 
regressors  x  but  later  relax  this  assumption.  A  semiparametric  model  is  considered, 
with  the  functional  form  for  Ao (t)  unspecified  and  the  functional  form  for  <f>(x,  (3)  fully 
specified. 

The  most  common  choice  of  </>(x,  (3 )  is  the  exponential  form 

</)(x,  (3)  —  exp(x'/3).  (17.26) 

This  permits  coefficients  to  be  easily  interpretable ,  in  addition  to  ensuring  0(x,  (3)  >  0. 
Suppose  the  jth  regressor  xj  increases  by  one  unit  and  other  regressors  are  unchanged; 
then 


A(t|xnew,  (3)  =  A0(r)  exp(x'/3  +  (3j)  (17.27) 

=  exp(/3;)A(/|x,  (3). 

Thus  the  new  hazard  is  cxp(/;  ;)  times  the  original  hazard,  and  the  change  in  the  hazard 
is  1  —  exp (/3j)  times  the  original  hazard.  If  one  instead  uses  calculus  methods,  the 
change  in  the  hazard  is  ft ;  times  the  original  hazard,  since 

dk(t\x,  f3)/dxj  =  A0(7)  exp(x'/3)/6;  =  PjX(t\x,f3).  (17.28) 

This  is  consistent  with  the  noncalculus  result  as  exp (/J  ■)  ~  I  +  Statistical  pack¬ 
ages  often  report  estimates  and  associated  confidence  intervals  for  both  f> ;  and 
exp  (Pj). 

For  more  general  forms  of  0(x,  (3),  changes  in  regressors  can  again  be  interpreted 
as  having  a  multiplicative  effect  on  the  original  hazard,  since 

9A(r|x,  (3)/dx  =  X0(t)d4>(x,  (3)/dxj  (17.29) 

=  A(f|x,  f3)x  [90(x,  (3)/dxj\  /4>(x,  (3). 

This  requires  knowledge  of  (3  but  not  of  the  baseline  hazard  Xo(t). 

An  important  issue  is  the  identification  of  the  PH  model.  This  is  discussed  in  the 
next  chapter  in  a  more  general  setting  that  allows  for  the  presence  of  unobserved  het¬ 
erogeneity  in  the  model. 
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17.8.2.  Partial  Likelihood  Estimation 

Cox  (1972,  1975)  proposed  a  method  to  estimate  (3  in  the  PH  model  that  does  not 
require  simultaneous  estimation  of  the  baseline  hazard  function  Xq (t).  If  desired  an 
estimate  of  the  baseline  hazard  can  be  recovered  after  estimation  of  f3.  The  results 
presented  here  accommodate  independent  censoring  and  tied  data. 

The  setup  resembles  that  in  Section  17.5,  with  failure  times  ordered  and  catego¬ 
rization  of  observations  into  those  that  die  or  are  at  risk  at  each  failure  time.  Let 
h  <  t2  <  ■  ■  ■  <  tj  <  ■  ■  ■  <  4  denote  the  observed  discrete  failure  times  of  the  spells 
in  a  sample  of  size  N,  N  >  k.  The  risk  set  R(tj)  is  defined  to  be  the  set  of  individuals 
who  are  at  risk  of  failing  just  before  the  /  tli  ordered  failure,  D{tj)  is  the  set  of  subjects 
that  die  at  time  tj,  and  d  j  denotes  the  number  that  die  at  time  tj.  To  summarize,  we 
have 


R(tj)  —  {/  :  ti  >  tj}  =  set  of  spells  at  risk  at  tj,  (17.30) 

D(tj  )  =  {/  :  ti  =  tj}  —  set  of  spells  completed  at  tj, 
dj  —  £, 1  (t/  =  tj)  —  number  of  spells  completed  at  tj. 


The  risk  set  at  time  tj  includes  all  spells  that  are  not  yet  completed  or  not  yet  censored. 
Tied  data  are  possible,  in  which  case  dj  >  1 . 

Now  consider  the  probability  of  a  particular  at-risk  spell  ending  at  time  tj.  The 
probability  that  spell  j  is  the  actual  spell  that  ends  equals  the  conditional  probability 
of  failure  for  spell  j  divided  by  the  conditional  probability  that  a  spell  of  any  individual 
in  the  risk  set  R(tj)  fails.  This  latter  probability  is  the  sum  of  the  conditional  probability 
of  failure  for  each  individual  in  R(tj).  Then 


Pr[7>  =  tj\R(tj)} 


Pr [P/  —  tjlTj  d  tj\ 
J2leR(tj )  Pr  \jl  —  U\Tl  >  tj] 


Xj(tj\Xj,P) 

12ieR(tj )  httj  lx/,  /3) 


(p(Xj,f3) 

J2leR(tj )  <P(xh  fl) 


where  in  the  last  line  the  baseline  hazard  factor  Xo (tj)  has  dropped  out,  as  a  conse¬ 
quence  of  the  PH  assumption.  (As  a  result  the  intercept  in  this  model  is  not  identified.) 
The  preceding  result  that  the  baseline  hazard  can  be  eliminated  provides  a  basis  for  es¬ 
timating  (3 .  However,  we  must  control  for  tied  durations  that  are  likely  to  occur  when 
durations  are  grouped. 

Ties  are  more  likely  when  durations  are  grouped.  If  the  data  include  ties  (i.e.,  there  is 
more  than  one  failure  at  a  given  time),  an  adjustment  is  needed.  For  example,  suppose 
there  are  two  tied  values  at  time  tj,  for  individuals  /)  and  /i  with  regressors  xy  |  and 
Xj2-  If  j i  fails  before  ji  then  the  probability  is 


(p(xjU/3)/  J2  <Mx/,/3)  +  0(x;2,/3)/  £  <Kxuf3), 

leR(tj)  leRi(tj) 

where  Ri(tj  )  equals  R(tj )  with  subject  j\  excluded.  A  similar  term  arises  if  jo  fails 
before  j\ ,  and  the  likelihood  contribution  is  the  sum  of  these  two  possibilities.  The 
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exact  likelihood  becomes  quite  complicated  with  many  tied  values.  A  standard  ap¬ 
proximation,  due  to  Breslow  and  Peto,  see  Cox  and  Oakes  (1984),  is  to  let 


Pr  [  7/  =  tj  |  j  e  R(tj)] 


n meDjtj  )  <KX/H’  3) 
)  <P(XI>  3) 


(17.31) 


where  D(tj )  denotes  the  set  of  subjects  that  die  at  time  tj  and  dj  denotes  the  number 
that  die  at  time  tj .  This  approximation  works  well  if  the  number  of  failures  at  time  tj 
is  small  relative  to  the  number  at  risk. 

Cox  defined  the  partial  likelihood  function  to  be  the  joint  product  of  Pr[  7)  = 
tj  |  /  e  R(tj)\  over  the  k  ordered  failure  times.  Then 


Lp(/3)  = 


k 


n 

7=1 


FI meD(tj  )  </’(xm>  3) 
3) 


(17.32) 


Cox  proposed  estimation  of  / 3  by  minimizing  the  log  partial  likelihood  function 


k 

(  \\ 

lnLp  =  £ 

^  In  </>(x„, ,  (3)  ~  dj  In  | 

0(x/,  /3) ) 

7=1 

meD(tj  ) 

\IeR(tj)  )  _ 

(17.33) 


Censored  spells  appear  only  in  the  second  term  of  lnLp  because  they  do  not  con¬ 
tribute  to  observed  deaths  but,  until  they  are  censored,  affect  the  size  of  the  risk  set. 
Equation  (17.33)  can  be  rewritten  as 


lnLp(/3)=  £5,- 


;=i 


ln</>(x,-,  (3)  —  In  I  ^  <Kxl  3) 

\leR(ti)  ) 


(17.34) 


where  the  indicator  variables  5,  =  1  for  uncensored  observation  and  equal  zero  other¬ 
wise. 

For  the  usual  specification  of  0(x,  j3)  =  exp  (x'/3),  so  that  In  0(x,  (3)  =  x'/3,  the  re¬ 
sulting  first-order  conditions  become 


3  lnLp  Q8) 
9/3 


N 

]T  9*  [x,  -  x*(/3)]  =  0, 

i= 1 


where  x*(J3)  =  Y.i^ru,)  xi  exp (xJ/3)/  J2!<eRUi)  exp(xJ/3)  is  a  weighted  average  of  the  re¬ 
gressors  x/  for  subjects  at  risk  at  failure  time  t, . 

The  partial  likelihood  is  a  limited  information  likelihood,  as  the  baseline  hazard 
A-o(t)  has  dropped  out,  but  is  neither  a  conditional  likelihood  nor  a  marginal  likelihood. 
Whether  Lp  (]3)  is  a  valid  likelihood  function  has  given  rise  to  much  discussion  in  the 
statistics  literature.  It  can  be  shown  (Andersen  et  al.,  1993)  that  even  though  lnLp  is 
not  the  full  likelihood  function,  the  estimator  of  (3  that  maximizes  In  Lp  is  consistent. 
See  also  Kalbfleisch  and  Prentice  (2002,  pp.  99-101)  and  Lancaster  (1990,  chapter  9). 

The  Chapter  5  results  on  extremum  estimation  apply,  with  the  simplification  that 
A(/3)  =  —  B(/3)  similar  to  the  ML  case,  so  that 


(3~N 


3, 


92  lnLp  (/3) 
9/39/3' 


(17.35) 
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The  estimator  is  inefficient,  though  comparisons  of  the  partial  likelihood  estimator 
with  the  MLE  for  fully  parametric  PH  models  such  as  the  Weibull  reveal  relatively 
small  efficiency  loss. 


17.8.3.  Survivor  Function  for  the  Cox  PH  Model 


Many  studies  stop  at  estimation  of  f3,  being  content  to  measure  the  impact  of  changes 
in  regressors  on  the  baseline  hazard  using  (17.28)  or  (17.29).  Other  studies  are  addi¬ 
tionally  interested  in  the  shape  of  the  baseline  hazard  function.  For  the  PH  model  it  is 
possible  to  obtain  a  nonparametric  estimate  of  the  baseline  hazard  or  survivor  function, 
once  (3  is  obtained  by  maximizing  the  partial  likelihood.  The  estimates  are  analogous 
to  the  Kaplan-Meier  estimator  of  Section  17.5.1. 

We  obtain  the  PH  hazard  function’s  associated  survivor  function 


S(t |x,  (3)  =  SQ{tf^\ 


using  S(t\x,  (3)  =exp  [  —  X0(s)<j)(x,  (3)ds\  and  defining  S0(0  =  exp  [  —  /Qr  A0 

(s)r/sl. 

Now  assume  a  discrete  time  formulation  with  baseline  hazard  rate  1  —  aj  at  discrete 
failure  time  tj,  j  =  l, ...  ,k.  Some  considerable  algebra  given  in  the  next  section 
yields  estimate  cr,  that  is  the  solution  to 


X  0(x/,  /3) 
iMj)  1  -  c?fx'3) 


E  0(x»m/3),  j  =  l,...,k, 

meR(tj  ) 


(17.36) 


where  f3  is  the  partial  likelihood  estimator  of  / 3 ,  D(tj)  denotes  the  subjects  that  die  at 
time  tj,  and  R(tj)  denotes  the  subjects  at  risk  at  time  tj.  From  the  discussion  of  dis¬ 
crete  time  hazard  in  Section  17.3.3,  the  baseline  survivor  function  So(f)  =  Wj\t  <t  a.h 
the  cumulative  product  of  the  instantaneous  conditional  survival  probabilities.  The  es¬ 
timated  baseline  survival  function  is  then 


% (0=  n  0Cj-  (17.37) 

If  there  are  no  regressors  then  S'o(t)  reduces  to  the  Kaplan-Meier  estimator  -  nor¬ 
malize  </>(x/,  (3)  =  1  and  the  expression  yields  hazard  rate  I  —  c/j  =  dj/rj.  If  there 
are  regressors  but  no  ties  then  the  expression  yields  baseline  hazard  rate  1  —  ctj  = 

0(x>,3)/E,„efi(O)</,(xF3). 

The  survivor  function  for  individuals  with  regressors  x  =  x*  can  be  estimated  using 

S(f|x*,/3)=  5o(f)0(x*'®. 


Linear  transformations  of  regressors  do  not  change  the  estimates  of  (3 ,  but  they  do 
change  the  baseline  hazard  function.  For  example, 

Mf|x,  (3 )  =  k0(t)exp(x'/3) 

=  k0(t)  exp(x'/3)  exp((x  -  x)'/3) 

=  k<$(t)exp((x  —  x)'f3), 

596 


17.9.  TIME-VARYING  REGRESSORS 


where  the  new  baseline  hazard  is  A ,J(r  exp((x  —  x)'/3).  Hence  subtracting  the  sample 
mean  from  each  regressor  will  change  the  baseline  hazard,  and  care  is  needed  in  inter¬ 
pretation  of  the  baseline  hazard  or  survivor  function. 

Also,  although  the  estimated  baseline  hazard  is  useful  for  computing  and  comparing 
hazard  rates  for  specific  groups  of  individuals,  it  may  have  a  very  choppy  appearance, 
so  some  smoothing  may  be  applied  for  ease  of  interpretation. 


17.8.4.  Derivation  for  the  Survivor  Function 

We  obtain  the  estimating  equations  for  a  j  given  in  (17.36),  following  Kalbfleisch  and 
Prentice  (2002,  pp.  1 14-118). 

A  subject  with  duration  time  tj  has  likelihood  contribution  equal  to  the  probability 
of  survival  time  t  >  tj- 1  less  the  probability  of  survival  time  t  >  tj.  This  is 

S(tj |x,  (3)  -  S{tj+ i|x,  /3)  =  S0(tj )«*•«  -  S0{tj+1)^ 

=  (aj'SolO+i))^’®  -  S0(tj+Ir^ 

=  -  lySoitj+i)^ 

using  S0(tj+ 1)  =  n/=i  “/  =  UjSo(tj). 

For  those  subjects  that  are  censored  at  time  tj  the  likelihood  contribution  is  the  prob¬ 
ability  of  survival  t  >  tj,  or  So(tj  1. 1  .  So  subjects  that  either  die  or  are  censored  in 

[tj,  tj- 1)  contribute  probability  =  n/=i  «f(x’ ,/3>  with  an  additional  mul¬ 
tiplier  7  —  |  j  for  subjects  that  die.  Then  over  all  failure  times  the  likelihood 

is 

L (a,  f3)  =  fl 
j= i 

The  log-likelihood  is 

fc 

lnL(a,/3)=^  J2  -  1)  +  -</>(x,„,  (3)  lna;- 

7=1  leDitj)  meR(tj ) 

Then  3  lnL(o:,  /3)/daj  =  0  can  be  re-expressed  as  (17.36). 


fl  (aT^'^  -  1)  FI  a;0(x"'’/3, 

leD(tj )  meR(tj ) 


17.9.  Time-Varying  Regressors 

The  preceding  results  have  been  restricted  to  models  where  regressors  are  variables 
such  as  gender  that  vary  across  individuals  but  for  given  individual  do  not  vary  over 
time.  This  is  standard  in  other  standard  cross-section  models  such  as  logit  and  To- 
bit  models.  For  survival  data,  however,  individuals  may  be  observed  at  several  stages 
during  a  spell  and  relevant  regressors  may  take  different  values  over  the  spell.  For 
example,  in  a  medical  survival  study  dosage  levels  of  a  medication  may  vary  over 
time  for  a  given  individual.  During  an  unemployment  spell  the  rate  of  unemployment 
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benefits  may  change,  perhaps  in  a  discrete  manner.  During  a  job  search  the  marital 
status  of  a  person  may  change. 

Time-varying  covariates  pose  two  kinds  of  problems.  First,  it  is  clearly  a  misspec- 
ification  to  treat  a  time-varying  covariate  as  a  fixed  variable.  The  entire  history  of 
the  covariate  over  the  spell  may  be  relevant,  a  consideration  that  may  require  us  to 
incorporate  lagged  values  of  some  regressors  as  determinants  of  the  hazard  rate.  Sec¬ 
ond,  a  time-varying  covariate  may  exhibit  feedback  and  hence  may  not  be  strictly 
exogenous  as  is  often  assumed  in  a  duration  model.  For  example,  the  duration  of  an 
unemployment  spell  may  depend  on  the  job  search  strategy  of  an  individual,  but  the 
latter  may  change  as  the  duration  of  unemployment  lengthens.  A  second  example  is 
that  the  dosage  level  of  the  treatment  may  be  varied  in  response  to  the  deteriorating 
or  improving  condition  of  the  patient.  Deterministic  time  variation  is  easier  to  han¬ 
dle  and  hence  standard  analysis  considers  only  the  first  of  these  issues,  requiring  the 
assumption  that  the  covariates  are  weakly  exogenous;  that  is,  whatever  the  process, 
stochastic  or  deterministic,  that  underlies  the  time  variation,  we  do  not  need  to  take 
account  of  the  parameters  of  that  process  in  estimating  the  hazard  model  under  con¬ 
sideration.  Some  authors  (e.g.,  Kalbfleish  and  Prentice,  2002,  pp.  196-200)  refer  to 
such  time  variation  as  external.  Endogenous  time-varying  covariates  are  then  called 
internal. 

One  rather  simple  solution,  especially  if  the  software  cannot  handle  time-varying 
covariates,  is  to  replace  the  time-varying  covariate  by  its  average  value  during  the 
spell.  Good  software,  however,  allows  greater  flexibility. 

Consider  an  individual  spell  of  (say)  unemployment  that  lasts  from  the  origin  to 
time  T,  at  which  time  a  transition  to  employment  is  observed.  Let  0  <  t\  <  to  <  t 3  < 
T,  where  t\ ,  to_,  and  to  are  intermediate  points  within  the  spell.  Suppose  that  there  are 
two  covariates  x\  and  xo(f)  that  are,  respectively,  time-invariant  and  time-varying.  For 
simplicity  assume  that  x\  is  binary  and  xt  takes  the  values  X2 (h),  xojtoj,  and  xo(to) 
in  a  step  fashion  in  the  intervals  [0,  t\),  [t\ ,  to),  and  \h,  T),  respectively.  Also  assume 
that  the  time-varying  regressor  is  exogenous  and/or  that  the  pattern  of  time  variation 
is  deterministic.  Then  for  this  particular  spell  the  data  can  be  written  as  a  three-line 
record,  rather  than  a  one  line  record,  as  follows: 


Observation 

Duration 

Xt 

X2  (t) 

Censoring  Indicator 

1 

h 

1 

XlO t) 

0 

1 

h 

1 

X2U2) 

0 

1 

T 

1 

X2  (T) 

1 

The  interpretation  of  this  information  is  that  we  can  split  the  total  observed  duration 
into  three  segments.  During  the  first  and  the  second  segment  the  covariate  values  are 
(1,  X2(fi))  and  (1,  xoOto)),  respectively,  and  no  transition  is  observed  (hence  the  censor¬ 
ing  indicator  is  0,  and  then  in  the  third  segment  the  covariate  values  are  (1,  xo(T))  and 
a  transition  is  observed.  This  is  akin  to  having  three  observations,  in  two  of  which  the 
duration  is  censored  and  in  the  third  duration  is  complete. 


598 


17.9.  TIME-VARYING  REGRESSORS 


Suppose  now  that  both  the  current  and  one  lagged  value  of  X2(t )  are  thought  to 
be  appropriate  covariates.  That  is,  the  hazard  rate  at  a  point  in  time  may  depend  on 
changes  in  a  covariate  earlier  in  the  spell.  Then  the  data  array  can  be  written  as  follows: 


Observation 

Duration 

*2  (0 

x2(t  -  1) 

Censoring  Indicator 

1 

h 

1 

*2  (ft) 

0 

0 

1 

h 

1 

•*2(0 

*2  (ft) 

0 

1 

T 

1 

x2(T) 

*2(0 

1 

Here  we  have  assumed  that  the  value  of  the  X2 (t)  prior  to  the  commencement  of  the 
spell  was  zero.  Notice  that  in  both  of  these  examples,  the  covariate  X2  (t)  varies  at 
discrete  points  in  time. 

Although  one  could  have  multiline  entries  in  a  data  set,  in  a  large  data  set  this 
is  potentially  tedious  and  confusing  if  the  software  ends  up  treating  the  entries  as 
different  observations.  Fortunately,  computer  software  can  usually  allow  the  user  to 
identify  a  time-varying  covariate  as  a  part  of  the  definition  of  the  regression  model. 
One  can  accommodate  step  functions  or  continuous  functions  in  terms  of  the  elapsed 
duration  of  the  spell. 


17.9.1.  Extended  Cox  Model 


The  fixed  regressor  analysis  of  the  Cox  model  in  Section  17.8  is  readily  extended  to 
time-varying  regressors. 

In  general  the  hazard  function  depends  on  the  complete  time  path  of  regressors  x(t), 
so  that 


A.(t|x(f))  =  lim 

Ar— >0 


Pr[f  <  T  <  t  +  At  |  x(t),T  >  f] 
At 


We  consider  the  PH  form 


A.(r|x(f))  =  k0(t,  a)(p(x(t),  (3), 

where  the  restriction  is  made  that  only  the  current  value  x(t)  of  the  covariate  matters, 
rather  than  the  entire  history  of  x(r). 

It  is  clear  from  Section  17.8.2  on  the  Cox  partial  likelihood  approach  that  what 
matters  at  each  failure  time  tj  is  the  value  of  regressors  x(tj)  for  those  observations  in 
the  risk  set  R(tj).  Thus  for  the  /'ill  subject  x,  is  replaced  by  Xj(tj).  The  partial  likelihood 
has  similar  changes,  and 

InLp  = 

7=1 

Note  that  the  form  of  the  data  is  more  complicated  now,  as  there  may  be  multiple 
observations  for  each  subject.  For  example,  suppose  time  is  in  discrete  integer  values. 
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there  is  only  one  regressor,  and  observation  one  has  completed  duration  25  and  regres¬ 
sor  xi,  which  takes  value  50  in  [0,  5],  100  in  [6,  15],  and  then  200  in  [16,  25].  Further, 
suppose  the  first  five  ordered  failure  times  are  3,  8,  13,  18,  and  25.  Then  x\ (0)  =  50, 
x\ (t2)  =  100,  A'i(f3)  =  100,  Xi (r4)  =  200,  and  xi(fs)  =  200. 


17.10.  Discrete-Time  Proportional  Hazards 

Grouped  duration  models  are  more  appropriate  when  failure  times  are  observed  or 
recorded  at  aggregated  time  intervals  like  a  week  or  a  month. 

A  simple  method  is  to  form  a  panel  and  estimate  a  stacked  logit  or  probit  model  of 
the  probability  of  individual  failure  in  each  period,  with  separate  intercept  for  period. 
This  is  presented  in  Section  17. 10.3.  However,  first  we  present  the  discrete-time  variant 
of  a  continuous-time  PH  model,  considered  by  several  authors  including  Kalbfleisch 
and  Prentice  (1980),  Fahrmeir  and  Tutz  (1994),  Kiefer  (1988),  and  Meyer  (1990).  Our 
exposition  follows  Blake,  Lunde,  and  Timmermann  (1999). 


17.10.1.  Discrete-Time  Proportional  Hazards 

For  grouped  data,  with  grouping  points  ta,  a  =  1 , . . . ,  A,  the  discrete-time  hazard  func¬ 
tion  is  defined  by 

kd(/fl|x)  =  Pr[f„_!  <  T  <  ta\T  >  ta_i ,  x(rfl_i)] ,  a  =  1 . A. 


Time-varying  regressors  are  permitted.  The  associated  discrete-time  survivor  function 
is 

a—  1 

sd(ta  |x)  =  Pr[T  >  ta-!  |X]  =  n  (1  -  kdfe|xfe))) . 

S=  1 


We  first  obtain  the  general  relationship  between  the  discrete-  and  continuous-time 
hazards.  The  discrete-time  hazard  is  the  probability  of  failure  in  [ta~ i,  ta)  divided  by 
the  probability  of  surviving  to  at  least  time  ta-i,  so  can  be  rewritten  as 


kd(ffl|x) 


S  (ta- l|x)  -  S  fcjx) 
S(ta_  i|x) 


(17.38) 


where  S(f|x)  is  the  survivor  function.  In  the  continuous  case  S(t  |x)  = 
exp(—  fl  X(s)ds),  and  after  some  algebra  (17.38)  becomes 


kd(tfl|x)  =  1  -  exp(— 


X(s)ds). 


(17.39) 


Now  specialize  to  the  discrete-time  hazard  associated  with  the  continuous  PH 
model 


X(t)  =  X0(t )  exp  (x(ta-i)'0)  , 

for  t  in  [ta- i,  tcl).  Note  that  the  regressors  are  constant  within  the  interval  but  can  vary 
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across  intervals,  and  Aq(0  can  vary  within  the  interval.  Then  (17.39)  becomes 

kd(tfl|x)  =  1  -  exp(- exp  (x(ffl_i)73)  x  f  A,0 (s)ds)  (17.40) 

Jta- 1 

=  1  -  exp  (-k0fl  exp  {x{ta-\Y 0)) 

—  1  -  exp  (-  exp  (in  A0fl  +  x(ta-\)' 0j) , 

where  A.q a  =  f'“  ]  )-o(s)ds.  The  associated  discrete-time  survivor  function  is 


Sd(t„|x)  =  n  exp  (—  exp(ln A0s  +  x(4_i)'/3)) .  (17.41) 

5=1 

The  density  for  the  ith  subject  is  the  product  of  the  survivor  function  in  each  period 
that  the  subject  survives  times  the  hazard  at  the  time  of  failure.  It  follows  from  (17.40) 
and  (17.41)  that  the  likelihood  is 


L(/3,  A.oi, . . . ,  ko a)  =  El 


;=i 


ij — l 


f]  exp  (-  exp  (In  A.0s  +  x,(4-i)73)) 


5=1 


(17.42) 


x  (1  -  exp  (-  exp  (lnk0fl;  +  Mta-iY$))) , 


where  censoring  is  ignored  for  simplicity  and  failure  is  assumed  to  occur  at  time  taj  for 
the  ith  observation.  At  least  one  failure  is  assumed  to  occur  in  each  interval  [ta~ i,  ta). 

The  MLE  maximizes  (17.42)  with  respect  to  (3  and  A.01, . . . ,  .  In  a  special  case 

partial  likelihood  is  asymptotically  equivalent  to  the  MLE,  though  in  general  they  dif¬ 
fer.  More  parsimonious  models  place  some  structure  on  the  Aoi, . . . ,  Xqa ,  such  as  a 
polynomial  in  time.  Even  more  structure  is  placed  by  a  fully  parametric  model  such  as 

the  Weibull,  which  sets  ko.(  =  f,“  ocsa~lcls. 

J  *a— 1 


17.10.2.  Han  and  Hausman  Approach 

Han  and  Hausman  (1990)  suggested  a  flexible  approach  to  recovering  the  baseline 
hazard  that  is  relatively  easy  to  implement  and  that  predates  the  work  of  Blake  et  al. 
(1999)  but  has  similarities  with  the  work  of  Meyer  (1990)  and  Sueyoshi  (1992).  It 
allows  for  considerable  flexibility  in  the  specification  of  the  baseline  hazard,  A.q  (t) , 
while  maintaining  a  parametric  form  (e.g.,  exp(x'/3))  for  the  function  of  covariates.  It 
also  has  the  merit  of  explicitly  dealing  with  discrete  duration  data  and  of  providing 
a  framework  that  can  more  easily  accommodate  features  of  discrete  data  such  as  tied 
observations  and  unobserved  heterogeneity.  Tied  observations  can  be  a  major  problem 
with  discrete  data;  for  example,  with  unemployment  durations  the  termination  of  many 
unemployment  spells  is  likely  to  coincide  with  the  end  of  the  period  of  unemployment 
benefits  (usually  26  weeks  in  the  United  States). 

The  starting  point  is  the  hazard  rate  for  observation  i,  X,  (r) ,  denoting  the  condi¬ 
tional  probability  that  a  spell  terminates  in  the  interval  (r,  r  +  A)  written  in  the  PH 
form: 


Xj  (r)  =  A.0  (r)exp  (— x(/3) , 
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where  /.(l  (r)  denotes  the  baseline  hazard.  Then  (as  shown  in  (17.20))  taking  logs  after 
integration  and  then  rearranging  yields 


A0  (0  —  x'/3  =  Si, 


(17.43) 


where  Ao  (t)  =  In  Xo  (r)  dr  denotes  the  log  of  the  integrated  baseline  hazard,  and 
Ej  =  In  L  a,  (r)  dr.  Then  the  probability  is  given  by 

r  A0(/)— 

Pr  [failure  in  period  r]  =  /  f  (s)ds. 

JAo(t-l)-^0 

Let  yit  =  1  if  the  / th  person  experiences  failure  in  period  t,  and  yit  =  0  otherwise. 
Then  the  joint  likelihood  of  N  observations  is  given  by 


N  T 


In  L  09,  Ao  (1) , . .  • ,  AoCO)  =  £  £  yit  In 

i=i  t=  l 


l 


A0(t)-x'/3 
A0(r — 1) — x</3 


f(e)ds 


(17.44) 


and  the  baseline  hazard  parameters  (Ao  (1) , . . . ,  Ao  (T))  are  estimated  along  with  (3 
in  a  flexible  manner  (i.e.,  without  imposing  a  specific  functional  form). 

The  integral  in  the  log-likelihood  is  of  course  the  difference  in  the  cdf  [  !\f)(t  —  1)  — 
x[/3,  Aq(0  —  x'/3],  The  precise  form  of  this  expression  depends  on  the  functional  form 
of  the  cdf.  If  the  random  s,  are  assumed  to  be  standard  normal  distributed,  the  log- 
likelihood  takes  the  ordered  probit  form;  under  the  assumption  of  extreme  value  distri¬ 
bution  the  log-likelihood  takes  the  ordered  logit  form.  To  be  specific,  under  normality 
the  integral  in  the  i  th  term  is  of  the  form 


Pr[A0(O  <  x[/3  +  Si  <  A0(f  +  D]  =  <D(A0(t  +  1)  -  x'/3)  -  O(A0(f)  -  xJ/3). 


In  contrast  to  the  partial  likelihood  approach,  which  treats  the  baseline  hazard  as 
a  nuisance  function  and  eliminates  it,  the  approach  of  Han  and  Hausman  (1990)  es¬ 
timates  all  the  unknown  parameters  simultaneously  at  a  modest  computational  cost. 
Their  Monte  Carlo  results  show  that  the  method  is  flexible  and  can  well  approximate 
arbitrary  hazard  function,  eliminating  the  need  for  strong  functional  form  assumptions. 


17.10.3.  Discrete-Time  Binary  Choice 

An  alternative  approach  for  discrete  duration  data  is  to  use  a  binary  choice  model  for 
transitions,  since  in  each  discrete  time  interval  two  outcomes  are  possible  -  the  spell 
either  ends  or  it  does  not. 

A  general  formulation  of  a  discrete-time  transition  model  is 

Pr|>fl-i  <  T  <  ta\T  >  t„_i|x]  =  F  (Xa  +  x'(t„_i)/3) ,  a  =  1, A.  (17.45) 

This  specification  restricts  the  coefficients  of  regressors  to  be  constant  over  time, 
whereas  the  intercept  Xa,  a  =  1, . . . ,  A,  can  vary  over  time.  The  obvious  choices  of 
the  function  F  are  the  standard  normal  cdf  or  the  logistic  cdf.  Then  the  parameters 
Xa  and  / 3  can  be  estimated  by  a  stacked  logit  or  stacked  probit  model  in  which  a  sep¬ 
arate  intercept  is  permitted  for  each  duration  interval.  This  method  is  very  appealing 
because  of  its  simplicity. 
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The  resulting  likelihood  function  is 


L(f3,X1,...,XA)=U 


i=i 


.  -5=1 


X  F  (xa.  +  x'(rfli_ i)0) 


This  is  similar  to  (17.42),  the  log-likelihood  for  discrete  time  PH  model,  aside 
from  the  choice  of  function  F.  The  hazard  (17.40)  is  the  extreme  value  cdf  evalu¬ 
ated  at  lnAof,  +  x(fa_i)'/3,  so  (17.40)  yields  the  complementary  log- log  model  binary 
choice  model  (see  Table  14.3)  rather  than  the  more  commonly  used  logit  or  probit 
model. 


17.11.  Duration  Example:  Unemployment  Duration 

The  following  empirical  application  uses  the  data  of  McCall  (1996),  generously  pro¬ 
vided  to  us  by  the  author  Brian  McCall.  The  data  set  is  derived  from  the  January 
Current  Population  Survey’s  Displaced  Workers  Supplements  (DWS)  for  the  years 
1986,  1988,  1990,  and  1992.  We  refer  to  the  duration  measure  (spell)  in  this  exam¬ 
ple  as  unemployment  duration,  though  more  accurately  it  represents  joblessness  du¬ 
ration  since  DWS  does  not  provide  information  as  to  whether  a  person  is  looking  for 
job  or  not. 

For  this  application,  information  on  the  part-time  or  full-time  status  of  the  first 
postdisplacement  job  is  required.  To  determine  whether  the  first  postdisplacement  job 
was  part-time  or  full-time,  the  following  method  is  adopted.  The  first  postdisplace¬ 
ment  job  is  designated  as  part-time  if  a  subject  was  still  in  that  job  at  the  time  of  the 
survey  and  if  the  subject  was  working  less  than  35  hours  per  week  in  that  job  in  the 
previous  week. 

Table  17.6  defines  the  key  economic  covariates  used  to  explain  joblessness  duration. 
The  number  of  covariates  in  the  models  estimated  is  quite  large,  but  in  the  interest  of 
brevity  only  a  subset  is  listed.  McCall  (1996)  provides  a  fuller  description. 


Table  17.6.  Unemployment  Duration:  Description  of  Variables 


Variable  Name 

Variable  Label 

Mean 

spell 

periods  jobless:  two-week  interval 

6.248 

CENSOR1 

1  if  reemployed  at  full-time  job 

0.321 

CENSOR2 

1  if  reemployed  at  part-time  job 

0.102 

CENSOR3 

1  if  reemployed  but  left  job:  pt-ft  status  unknown 

0.172 

CENSOR4 

1  if  still  jobless 

0.375 

UI 

1  if  filed  UI  claim 

0.553 

RR 

eligible  replacement  rate 

0.454 

DR 

eligible  disregard  rate 

0.109 

TENURE 

tenure  years  in  lost  job 

4.114 

LOGWAGE 

log  weekly  earnings 

5.693 
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Overall  Survival  Function  Estimate 


Unemployment  Duration  in  2-week  intervals 

Figure  17.3:  Unemployment  duration:  Kaplan-Meier  estimate  of  survival  function.  U.S.  data 
from  1986-92  on  3343  spells,  some  incomplete. 


Unemployment  durations  have  been  measured  in  two-week  intervals.  Four  binary 
variables  (CENSOR1,  CENSOR2,  CENSOR3,  and  CENSOR4)  have  been  introduced 
to  indicate  the  status  of  the  first  postdisplacement  job.  For  the  analysis  in  this  chapter 
we  use  CENSOR  1.  Thus  a  spell  is  complete  if  person  is  re-employed  at  a  full-time  job. 
Another  indicator  variable  UI  is  used  to  denote  whether  the  subject  filed  an  unemploy¬ 
ment  claim  or  not.  Replacement  rate,  which  is  the  weekly  benefit  amount  divided  by 
the  amount  of  weekly  earnings  in  the  lost  job,  is  represented  by  the  variable  RR.  “Dis¬ 
regard”  is  defined  to  be  the  threshold  amount  up  to  which  recipients  of  unemployment 
insurance  who  accept  part-time  work  can  earn  without  any  reduction  in  unemployment 
benefits.  Disregard  rate  is  the  disregard  divided  by  weekly  earnings  in  the  lost  job.  It 
is  described  by  the  variable  DR  in  this  example.  As  we  can  see,  all  the  other  variables 
are  self-explanatory. 

We  begin  with  a  descriptive  analysis  of  the  duration  data.  The  simplest  first  step  is  to 
plot  the  Kaplan-Meier  survival  curve,  which  is  shown  in  Figure  17.3  by  the  dark  line. 
The  lighter  lines  around  the  estimated  Kaplan-Meier  survival  curve  represent  95% 
confidence  intervals  developed  in  Section  17.5.2.  As  expected,  the  estimated  survival 
curve  declines  rapidly  at  first  and  then  slowly. 

As  we  see  from  Table  17.7,  after  the  first  period  the  survival  probability  is  0.91,  in¬ 
dicating  that  roughly  9%  of  the  sampled  individuals  have  terminated  their  spell  within 
the  first  two  weeks  of  beginning  joblessness  spell. 

In  Figure  17.4,  we  plot  the  survival  function  by  UI,  that  is,  by  whether  the  subject 
claims  unemployment  insurance  or  not.  Again,  as  one  can  expect,  it  shows  that  those 
who  claim  unemployment  insurance  are  more  likely  to  remain  unemployed  than  those 
who  do  not  claim  unemployment  insurance. 

The  Nelson-Aalen  cumulative  hazard  in  Figure  17.5  shows  little  variation  in  the 
hazard  rate,  which  translates  into  an  approximately  linear  hazard.  If  the  crude  hazard 
rate  varies  a  lot,  then  the  cumulative  hazard  would  appear  nonlinear. 
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Table  17.7.  Unemployment  Duration:  Kaplan-Meier 
Survival  and  Nelsen—Aalen  Cumulated  Hazard 
Functions 


Time 

Survivor  Function 

Cumulative  Hazard 

1 

0.9121 

0.0879 

2 

0.8541 

0.1514 

3 

0.8103 

0.2027 

4 

0.7864 

0.2322 

5 

0.7376 

0.2943 

12 

0.5974 

0.5005 

13 

0.5680 

0.5496 

14 

0.5270 

0.6219 

26 

0.3651 

0.9809 

27 

0.3098 

1.1325 

28 

0.3098 

1.1325 

The  cumulated  hazard  functions  by  UI  recipiency,  shown  in  Figure  17.6,  exhibit 
the  expected  pattern:  The  hazard  increases  at  a  higher  rate  for  those  who  do  not  claim 
unemployment  insurance  than  it  does  for  those  who  do. 

Next  we  consider  four  parametric  regression  models  using  the  covariates  UI,  RR, 
DR,  and  LOGWAGE  and  the  interaction  terms  RRUI  and  DRUI.  The  four  models  are 
exponential,  Weibull,  Gompertz,  and  Cox  PH.  Writing  the  hazard  function  as 

A.(/|x)  =  Xo(t,  a)0(x,  j3)  =  Xo(t,  a)  exp(x'/3), 


Figure  17.4:  Unemployment  duration:  estimated  survival  functions  by  whether  or  not  sub¬ 
jects  receive  unemployment  insurance.  Same  data  as  Figure  17.3. 
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Overall  Cumulative  Hazard  Estimate 


o  10  20  30 

Unemployment  Duration  in  2-week  intervals 


Figure  17.5:  Unemployment  duration:  Nelson-Aalen  estimate  of  cumulative  hazard  function. 
Same  data  as  Figure  17.3. 

recall  that  exponential  hazard  assumes  Xq (/,  a.)  =  constant  =  exp(fl)  for  some  con¬ 
stant  a,  the  Weibull  model  assumes  Ao (/,  a)  =  exp (a)ata~l  (i.e.,  monotonic  hazards), 
Gompertz  assumes  Aq (/,  a)  =  exp(a)  cxpfy /),  and  the  Cox  PH  model  has  no  inter¬ 
cept  and  makes  no  assumption  about  the  shape  of  the  baseline  hazard.  Recall  also  that 
the  formulation  here  is  of  the  proportional  hazard  type  and  can  also  be  interpreted 
either  as  a  parametric  regression  model  or  as  an  AFT  model.  In  this  parameteriza¬ 
tion  of  the  likelihood  function,  the  parameters  (o',  /3)  are  estimated.  These  are  given 
in  Table  17.8  with  the  associated  /-statistics.  We  also  list  the  negative  of  the  log- 
likelihood,  but  recall  that  for  the  Cox  PH  model  it  is  the  partial  log-likelihood.  Both 
exponential  and  Gompertz  models  fit  equally  well.  The  Weibull  model  provides  the 


Cumulative  Hazard  Estimates  by  Ul  Status 


Unemployment  Duration  in  2-week  intervals 

Figure  17.6:  Unemployment  duration:  estimated  cumulative  hazard  functions  by  whether 
or  not  receive  unemployment  insurance.  Same  data  as  Figure  17.3. 
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Table  17.8.  Unemployment  Duration:  Estimated  Parameters  from  Four  Parametric 
Models 


Var 

Exponential 

Weibull 

Gompertz 

Cox  PH 

coeff. 

t 

coeff. 

t 

coeff. 

t 

coeff. 

t 

RR 

0.472 

0.79 

0.448 

0.70 

0.472 

0.78 

0.522 

0.91 

DR 

-0.576 

-0.75 

-0.427 

-0.53 

-0.563 

-0.74 

-0.753 

-1.04 

UI 

-1.425 

-5.71 

-1.496 

-5.67 

-1.428 

-5.69 

-1.317 

-5.55 

RRUI 

0.966 

0.92 

1.105 

1.57 

0.969 

1.58 

0.882 

1.52 

DRUI 

-0.199 

-0.20 

-0.299 

-0.28 

-0.211 

-0.21 

-0.095 

-0.10 

LOGWAGE  0.35 

3.03 

0.37 

2.99 

0.35 

3.03 

0.34 

3.03 

CONS 

-4.079 

-4.65 

-4.358 

-4.74 

-4.097 

-4.65 

- 

- 

a 

1.129 

—In  L 

2700.7 

2687.6 

2700.6 

- 

best  fit.  As  we  see  from  Table  17.8,  the  fit  of  the  Weibull  model  exhibits  positive  state 
dependence  (a  =  1 .129  >  1);  that  is,  the  probability  of  the  spell  terminating  increases 
as  the  spell  lengthens. 

For  all  the  models  considered,  only  UI  and  LOGWAGE  are  significant  whereas 
other  covariates  are  not.  The  estimated  coefficient  of  UI  is  negative  for  all  models, 
implying  that  the  joblessness  spell  of  those  who  claim  unemployment  insurance  ter¬ 
minates  slower.  There  is  little  variation  of  the  estimates  of  UI  across  different  models: 
This  estimate  in  Weibull  and  Gompertz  models  is  approximately  5%  and  0.2%  higher 
in  absolute  value  than  that  in  the  exponential  model,  whereas  it  is  8%  lower  in  the  Cox 
PH  model.  Similarly,  the  estimate  of  the  coefficient  of  LOGWAGE  is  positive  for  all 
the  models  and  exhibits  very  little  variation  across  models. 

Whereas  in  the  econometric  literature  it  is  common  to  report  the  estimate  of  ( a ,  (3) 
coefficients  of  the  hazard  function  in  AFT  metric,  in  the  biostatistics  literature  a  differ¬ 
ent  parameterization  is  often  used  based  on  the  PH  metric.  Note  that  the  hazard  ratio 
X(t |x)/A.o(f,  a)  =  0(x,  /3)  =  exp(x'/3).  For  a  categorical  0/1  scalar  variable  x,  the  im¬ 
pact  of  a  change  from  0  to  1  is  given  by  cxpi/f)  —  1,  which  measures  impact  relative  to 
the  baseline  hazard.  Numerous  packages  give  the  users  an  option  to  estimate  the  model 
in  either  or  both  metrics.  The  relative  merits  of  the  two  parameterization  are  discussed 
in  Cleves,  Gould,  and  Guitirrez  (2002). 

Consider  the  exponential  specification  in  Table  17.9  where  the  coefficients  are  ex¬ 
ponentials  of  the  corresponding  ones  Table  17.8.  Here  UI  has  hazard  ratio  0.241.  This 
means  that  belonging  to  the  category  of  subjects  that  claims  unemployment  insurance 
decreases  the  hazard  by  nearly  76%  over  the  baseline  hazard.  Similarly,  for  Weibull, 
Gompertz,  and  Cox  PH  models,  the  hazard  decreases  by  about  78%,  76%,  and  73%, 
respectively. 

For  this  example,  we  have  taken  into  account  right-censoring  and  have  ignored  the 
role  of  unobserved  heterogeneity.  Hence  the  results  obtained  from  the  three  models  are 
qualitatively  similar.  However,  the  relatively  few  included  variables  with  significant 
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Table  17.9.  Unemployment  Duration:  Estimated  Hazard  Ratios  from  Four  Parametric 
Models 


Exponential 

Weibull 

Gompertz 

Cox  PH 

Var 

/3 

t 

/3 

t 

/3 

t 

/3 

t 

RR 

1.603 

0.63 

1.565 

0.57 

1.604 

0.62 

1.686 

0.71 

DR 

0.562 

-1.02 

0.653 

-0.66 

0.570 

-0.99 

0.471 

-1.55 

UI 

0.241 

-12.65 

0.224 

-13.12 

0.240 

-12.65 

0.268 

-11.53 

RRUI 

2.626 

1.01 

2.760 

0.99 

2.635 

1.01 

2.416 

1.01 

DRUI 

0.819 

-0.22 

0.742 

-0.33 

0.810 

-0.23 

0.909 

-0.10 

LOGWAGE 

a 

1.420 

2.56 

1.441 

1.129 

0.08 

1.42 

2.55 

1.40 

2.57 

—In  L 

2700.7 

2687.6 

2700.6 

- 

coefficients  probably  indicates  that  large  unexplained  variation  (perhaps  caused  by 
unobserved  heterogeneity)  may  be  a  serious  problem.  This  issue  is  considered  further 
in  the  next  chapter. 


17.12.  Practical  Considerations 

Most  computer  packages  offer  a  good  selection  of  computer  programs  for  parametric 
survival  analysis.  Standard  nonparametric  Kaplan-Meier  survival  function  estimates, 
with  or  without  confidence  intervals,  with  both  numeric  and  graphic  output  are  widely 
available.  In  some  cases  survival  analysis  modules  are  sufficiently  detailed  to  warrant 
a  special  manual.  For  example,  Allison  (1995)  offers  a  practical  guide  to  survival  anal¬ 
ysis  in  the  SAS  system;  Cleves  et  al.  (2002)  provide  a  tutorial  style  guide  to  survival 
analysis  in  STATA.  Not  only  do  these  guides  explain  the  mechanics  of  implementing 
particular  program  commands,  but  in  many  cases  they  provide  insightful  expositions  of 
the  subtleties  arising  from  specific  features  of  data,  alternative  parameterizations,  and 
interpretation  of  results.  A  convenient  way  to  leam  about  duration  data  analysis  is  by 
using  the  examples  in  econometrics  or  statistical  packages  such  as  LIMDEP,  STATA, 
SAS,  or  S-Plus.  The  program  manuals  are  themselves  excellent  sources  of  information 
for  standard  models. 


17.13.  Bibliographic  Notes 

17.3-17.7  Kalbfleisch  and  Prentice  (1980,  2002)  is  the  classic  statistical  reference  for  survival 
analysis,  with  emphasis  on  the  Cox  model.  Other  useful  sources  include  Lawless 
(1982)  and  Cox  and  Oakes  (1984)  and  the  considerable  number  of  statistical  texts 
on  survival  analysis  that  now  exist.  For  a  Bayesian  treatment  see  Ibrahim,  Chen, 
and  Sinha  (2001).  Recent  statistical  work  has  increasingly  emphasized  the  counting 
process  approach,  detailed  in  Fleming  and  Harrington  (1991)  and  Andersen  et  al. 
(1993). 
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17.13.  BIBLIOGRAPHIC  NOTES 


These  references  are  very  challenging,  especially  the  latter.  Lancaster  (1990) 
provides  a  thorough  treatment  of  survival  analysis,  though  the  presentation  is  quite 
technical  and  the  book  is  oriented  more  to  the  general  topic  of  transitions  and  mate¬ 
rial  presented  in  the  subsequent  two  chapters.  For  social  scientists,  Allison’s  (1984) 
excellent  exposition,  like  that  of  Lancaster,  covers  much  more  than  single-spell 
survival  analysis.  For  practitioners  in  microeconomics  the  survey  article  by  Kiefer 
(1988)  is  a  good  start. 

17.8  Lancaster  (1990)  provides  a  thorough  discussion  of  the  partial  likelihood  approach. 

17.10  Meyer  (1990),  Flan  and  Hausman  (1990),  and  Blake  et  al.,  (1999)  are  helpful  ref¬ 
erences  on  discrete  hazard  models.  These  articles  generally  allow  for  unobserved 
heterogeneity,  a  topic  discussed  in  the  next  chapter. 

17.11  Economics  applications  are  cited  in  Kiefer  (1988)  and  in  Greene  (2003).  Good  ex¬ 
amples  of  parametric  reduced-form  type  duration  analysis  are  given  by  Lancaster 
(1979),  Narendranathan,  Nickell,  and  Stern  (1985),  Jaggia  (1991c),  and  Gritz 
(1993).  More  recently  the  emphasis  has  shifted  to  computationally  more  complex 
structural  duration  models.  Examples  are  found  in  Van  den  Berg  (1990)  and  Ferall 
(1997).  Most  applications  of  duration  analysis  are  reduced-form  models.  Economists 
have  proposed  structural  duration  models;  references  include  Lancaster  (1990)  and 
Van  den  Berg  (2001).  Van  den  Berg  also  provides  an  interesting  discussion  of  the 
economic  theoretical  foundations  of  the  PH  model.  Duration  data  can  often  be  ana¬ 
lyzed  using  different  concepts  of  waiting  time.  Tunali  and  Pritchett  (1997)  use  three 
alternatives:  calendar-time,  age,  and  duration. 


- Exercises - 

17-1  (Adapted  from  Sapra,  1998)  Show  that  the  duration  data  model  with  Pareto 
density  of  the  first  kind  f  (f)  =  a/t"/  f“+1 ,  a  >  0,  t  ^  k  ^  0,  is  an  accelerated  fail¬ 
ure  time  duration  model  but  is  not  a  proportional  hazards  model.  [Hint:  Show 
that  In  t  can  be  expressed  as  a  linear  regression  in  k  =  exp(x'/3)  with  an  additive 
homoskedastic  error.] 

17-2  (Based  on  Lancaster,  1979).  For  each  of  the  following  situations  develop  an 
appropriate  expression  for  the  joint  likelihood  of  N  observations  in  terms  of  the 
duration  density  f(t |x,  0)  and  survivor  function  S(f|x,  6). 

(a)  A  sample  of  independent  completed  durations,  ?,-,/  =  1, _ N,  is  available. 

(b)  The  sample  is  generated  as  follows.  Initially,  individuals  are  selected  from  a 
pool  of  unemployed  and  interviewed.  Subsequently,  they  are  reinterviewed 
after  h  periods.  Selected  individuals  have  been  unemployed  for  t  weeks 
on  selection.  Between  selection  and  interview  some  find  jobs,  and  others 
do  not.  For  those  who  have  jobs  the  time  of  termination  of  unemployment 
spells  is  known. 

(c)  The  situation  is  the  same  as  in  (b)  except  that  it  is  not  known  when  the 
unemployment  spell  ended. 

1 7-3  (a)  Using  a  50%  random  sample  of  the  McCall  data  set  estimate  the  Kaplan- 
Meier  nonparametric  survival  and  integrated  hazard  function  estimates  by 
type  of  censoring,  that  is,  by  whether  transition  is  to  full-time  or  part-time 
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employment.  Does  the  survival  function  look  significantly  different  for  the 
two  groups? 

(b)  Ignoring  the  censoring  variable  for  type  of  spell  termination,  estimate  the 
hazard  model  for  unemployment  duration  under  the  following  parametric  dis¬ 
tributional  assumptions:  (i)  exponential,  (ii)  Weibull,  (iii)  log-logistic,  and  (iv) 
Cox  PH.  Use  the  same  covariates  as  in  this  chapter. 

(c)  Compare  models  (i)-(iii)  and  discuss  which  one  you  think  provides  the  best 
fit  to  the  data.  What  does  each  model  imply  regarding  the  duration  depen¬ 
dence  (shape  of  the  hazard  function)  of  a  spell  of  unemployment? 
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Mixture  Models  and  Unobserved 
Heterogeneity 


18.1.  Introduction 

There  is  a  large  statistical  and  econometric  literature  concerning  the  topic  of  unob¬ 
served  heterogeneity.  Observed  heterogeneity  refers  to  interindividual  differences  that 
are  measured  by  regressors,  and  unobserved  heterogeneity  refers  to  all  other  differ¬ 
ences.  Both  factors  affect  survival  times.  In  the  presence  of  unobserved  heterogeneity 
even  individuals  with  the  same  values  of  all  covariates  may  have  different  hazards  out 
of  a  given  state.  When  unobserved  heterogeneity  is  ignored,  its  impact  is  confounded 
with  that  of  the  baseline  hazard. 

To  motivate  further  study  consider  a  well-known  empirical  example.  The  aggregate 
hazard  rate  out  of  unemployment  is  known  to  be  a  declining  function  of  the  length 
of  unemployment  spell.  If  all  individuals  were  identical  then  this  would  imply  nega¬ 
tive  duration  dependence,  that  is,  a  falling  probability  of  escaping  unemployment  the 
longer  an  individual  has  remained  unemployed.  However,  suppose  that  there  are  two 
types  of  individuals  in  the  unemployed  population,  type  F  (fast),  who  have  a  constant 
hazard  rate  of  0.4,  and  type  S  (slow),  whose  constant  hazard  rate  is  0. 1 .  The  population 
is  a  50/50  mixture  of  the  two  types.  Then  for  100  type  F  people  we  observe  40  transi¬ 
tions  in  the  first  period,  24  transitions  in  the  second  period,  and  14.4  in  the  third.  For 
the  type  S,  we  observe  10,  9,  and  8.1  transitions  in  the  first,  second,  and  third  periods, 
respectively.  Hence  the  aggregate  proportion  of  transitions  will  be  (40+  10)/200  = 
0.25,  (24  +  9)/150  =  0.22,  and  (14.4  +  8.1)/1 17  =  0.192.  This  shows  that  the  de¬ 
clining  aggregate  hazard  is  a  consequence  of  aggregation  across  heterogeneous  groups, 
which  themselves  have  constant  but  different  hazard  rates.  Accurate  statements  about 
duration  dependence  require  that  models  incorporate  unobserved  heterogeneity. 

In  linear  regression  models  there  are  no  complications  caused  by  unobserved  het¬ 
erogeneity  if  the  heterogeneity  is  independent  of  regressors.  In  that  case  the  conditional 
mean  is  unchanged,  the  unobserved  heterogeneity  is  absorbed  into  the  error  term, 
and  there  is  no  omitted  variables  bias.  In  contrast,  unobserved  heterogeneity  usually 
causes  problems  in  durations  models.  In  the  simplest  models,  such  as  the  exponential 
model,  it  is  possible  to  specify  multiplicative  unobserved  heterogeneity  uncorrelated 
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with  regressors  that  leaves  the  conditional  mean  duration  unchanged.  However,  even 
in  this  simple  case  the  conditional  hazard  function  does  change,  and  it  is  the  hazard 
that  is  modeled  out  of  necessity,  given  the  presence  of  censoring  and  given  also,  for  ex¬ 
ample,  the  interest  of  policy  makers  in  determining  how  exit  rates  from  unemployment 
vary  with  length  of  unemployment  spell. 

The  role  of  unobserved  heterogeneity  lies  at  the  heart  of  numerous  empirical  puz¬ 
zles  and  conundrums.  Although  our  focus  in  this  chapter  is  in  the  context  of  duration 
models,  most  of  the  issues  are  of  more  general  relevance.  The  material  and  techniques 
used  here  are  also  relevant  to  all  econometric  models,  since  all  econometric  models 
omit  some  individual-specific  unobservable  variables  from  the  model.  Leading  exam¬ 
ples  in  other  chapters  include  random  parameters  logit  (Section  15.7),  sample  selection 
(Section  16.4),  finite  mixture  for  counts  (Section  20.4)  and  fixed  and  random  effects 
models  for  panel  data  (Chapters  21-23).  These  factors  go  under  the  collective  heading 
of  unobserved  heterogeneity.  In  biostatistics  the  term  frailty  is  also  used.  In  actuar¬ 
ial  studies  (multiplicative)  unobserved  heterogeneity  measures  proportional  increase 
or  decrease  in  the  hazard  rate  (“force  of  mortality”)  operating  on  a  given  individual 
relative  to  that  on  an  average  individual.  Individual-specific  heterogeneity  need  not  be 
time-invariant,  but  in  cross  section  models  it  is  convenient  to  assume  it  is. 

It  is  important  to  consider  the  consequences  of  such  an  unavoidable  misspecifica- 
tion.  From  ordinary  linear  multiple  regression  analysis  it  is  known  that  such  an  omis¬ 
sion  in  general  can  lead  to  an  omitted  variable  bias.  In  duration  models,  which  are 
nonlinear,  the  analysis  of  unobserved  heterogeneity  is  more  complex.  Introduction  of 
unobserved  heterogeneity  leads  to  an  important  class  of  models  called  mixture  mod¬ 
els,  this  being  one  of  the  many  names  for  this  class.  The  subject  matter  of  this  chapter 
concerns  both  the  techniques  for  generating  and  analyzing  mixture  models  and  the 
substantive  consequences  of  omitted  heterogeneity. 

Distinguishing  between  heterogeneity  and  true  state  dependence  has  been  a  long¬ 
standing  issue  that  can  be  traced  back  in  history  to  discussions  concerning  true  and  ap¬ 
parent  contagion.  Neyman  has  been  credited  for  his  early  insight  that  longitudinal  data 
may  be  essential  to  make  this  distinction  empirically  possible.  When,  however,  only 
cross-section  data  are  available,  there  is  a  tendency  to  rely  heavily  on  strong  parametric 
assumptions.  The  emphasis  in  the  recent  literature  has  been  to  free  empirical  analysis 
from  such  assumptions  and  on  testing  the  validity  of  maintained  model  assumptions. 

The  first  part  of  this  chapter,  Sections  18.2-18.4,  deals  with  mixture  models  based 
on  continuous  distribution  of  heterogeneity.  Section  18.5  presents  models  based  on 
discrete  heterogeneity.  Section  18.6  considers  relationships  among  different  duration 
concepts  from  flow  and  stock  data.  Tests  of  misspecification  and  neglected  heterogene¬ 
ity  are  dealt  with  in  Section  18.7.  An  empirical  example  in  Section  18.8  illustrates 
several  of  the  ideas  developed  in  the  chapter. 


18.2.  Unobserved  Heterogeneity  and  Dispersion 

In  this  section  we  focus  on  unobserved  heterogeneity  in  the  exponential  and  Weibull 
models.  We  consider  a  form  of  multiplicative  unobserved  heterogeneity  that,  after 
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being  integrated  out,  leaves  the  conditional  mean  unchanged  but  does  inflate  the  con¬ 
ditional  variance  and,  more  importantly,  changes  the  conditional  hazard  function.  The 
popular  Weibull  model  with  gamma  distributed  heterogeneity  is  also  presented. 


18.2.1.  Mixtures 

The  simplest  model  to  consider  is  the  exponential  duration  model.  In  an  exponential 
regression  without  heterogeneity  the  distribution  of  complete  spells,  f,-,  is  specified 
conditional  on  observable  weakly  exogenous  covariates  x, .  This  is  equivalent  to  spec¬ 
ifying  the  conditional  mean  function  as  nonstochastic:  E[T|x]  =  exp(x'/3).  In  mixture 
models  we  instead  specify  the  distribution  of  (f,  |x, ,  u, ),  where  the  additional  v,  denotes 
an  unobserved  heterogeneity  term  for  observation  i.  Simply,  individuals  are  assumed 
to  differ  randomly  in  a  manner  not  fully  accounted  for  by  the  observed  covariates.  The 
marginal  distribution  of  f,  is  obtained  by  averaging  with  respect  to  u, . 

The  precise  functional  form  linking  t,  and  (x,  ,  u, )  must  be  specified.  A  commonly 
used  functional  form  is  the  exponential  mean  with  a  multiplicative  error.  For  example, 
consider  the  PH  model  with  unobserved  heterogeneity.  From  Section  17.8  we  have  the 
proportional  hazards  model,  (17.25)  and  (17.26),  which  can  be  extended  to  include  a 
multiplicative  term  v.  That  is, 


k(t|x,  v )  =  ko(t)  exp(x'/3)v,  v  >  0, 


and  hence  we  can  obtain  an  expression  for  integrated  baseline  hazard  as  follows: 


ko(f)  =  X  (t  |x,  v)  exp(— x'/3)v  1 , 

J  Xo(u)du  —  exp(— x'/3)v_1  J  X  (m|x,  v)du. 


(18.1) 


In 


J 


Xq  (u)du 


=  —x/3  -  In  v  +  £, 


where  s  =  In  fX  (u  |x,  v)  du,  and  v  is  assumed  to  be  independent  of  the  regressors  and 
of  censoring  time.  A  common  normalization  restriction  is  E[v]  =  1.  When  v  >  I .  the 
hazard  rate  is  greater  than  for  the  average  subject;  it  is  less  than  that  for  the  average 
subject  if  v  <  1.  The  independence  assumption  is  strong  and  not  necessarily  realistic. 
The  multiplicative  heterogeneity  assumption  is  also  rather  special,  but  it  is  mathe¬ 
matically  convenient  and  more  attractive  than  an  additive  error,  which  could  violate 
nonnegativity  of  f,  .  A  standard  approach  involves  postulating  a  distribution  for  v, ,  and 
then  deriving  the  marginal  distribution  of  t, . 

Multiplicative  heterogeneity  has  two  important  and  related  consequences.  Not  sur¬ 
prisingly,  the  variance  of  the  mixture  (conditional  on  the  observable  variables)  exceeds 
the  variance  of  the  parent  distribution  (conditional  on  both  the  observables  and  het¬ 
erogeneity).  That  is,  the  variance  gets  inflated.  Consider  the  exponential  mean  case. 
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Replace  /x,  =  exp(xj  (3)  by 

M*  =  E[f,|x,,  y,]  (18.2) 

=  exp  (x,/3)v,- 
=  exp(xJ/3)  exp(e() 

=  exp(^o  +  £/  +  x'u^1), 

where  the  unobserved  heterogeneity  term  u,  is  redefined  as  expie, )  in  the  third  line, 
and  the  term  x-/3  is  broken  into  the  intercept  and  slope  terms  in  the  last  line.  The 
last  line  has  an  interpretation  as  a  conditional  mean  with  a  randomly  varying  intercept 
(/6o  +  £;)•  It  is  usually  assumed  that  u, s  are  iid,  possibly  with  a  known  parametric 
distribution,  and  that  they  are  independent  of  the  x, . 

Assume  that  u,  is  iid  with  E[v,]  =  1  and  V[v,]  =  a2.  The  assumption  that  E[v(]  =  1 
permits  identification  of  the  intercept.  For  the  exponential  density,  the  moments  of 
tj  can  be  derived  as  E[f,  |x,  ,  v,]  =  /x,  u, ,  and  using  Section  A.8  result  on  variance 
decomposition, 

V  [fi|x(]  =  Vy[E,|Ui!t(r,|v/,x,)]  +E1,[V,|y,x(f(|u/,x,)]  (18.3) 

=  M?V(vi)  +  At?(V(vi)+l) 

=  Mv  ['  +  2cr;] 

> 

The  unconditional  variance  is  inflated  by  unobserved  heterogeneity. 


18.2.2.  Choice  of  Heterogeneity  Distribution 


Consider  how  the  distribution  of  t  is  affected  by  heterogeneity.  This  requires  us  to 
look  at  the  marginal  distribution  of  f,  by  integrating  out  the  heterogeneity  term,  v, 
from  S(t\x,  v).  A  parametric  distribution  of  v  is  usually  specified.  What  considerations 
apply  to  choosing  this  distribution? 

To  respect  the  property  v,  >  0,  we  may  specify  a  distribution  with  support  on  the 
positive  line.  Examples  are  gamma,  inverse  Gaussian,  and  log-normal. 

The  gamma  density  is 


g(v,8,  k) 


8kvk~l  exp(— (5v) 

m) 


v  >  0, 


(18.4) 


which  has  E[u]  =  k/8  and  V[u]  =  k/82.  Normalization  sets  k  =  8 ,  E[u]  =  1,  and 
V[v]  =1/8.  The  gamma  assumption  is  mathematically  convenient.  It  is  also  employed 
in  a  number  of  popular  software  packages  for  duration  modeling. 

The  inverse-Gaussian  density  is 


g(v\  8,  0)  =  8n  l^2  exp  (289 ^2)  v  3^2  exp  (— Ov  —  <52/v)  ,  v  >  0,  (18.5) 

which  has  E[v]  =  86~1^2  and  V[v]  =  <5#“3/2/2.  Normalization  6  =  82  yields  E[v]  = 
1,  and  V[v]  =  1  /2 9.  Relative  to  the  gamma  the  inverse-Gaussian  distribution  has  more 
tail  probability. 
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These  will  not  necessarily  produce  an  analytically  tractable  marginal  distribution  of 
t.  As  we  will  show,  some  combinations  such  as  exponential  and  gamma,  or  Weibull 
and  gamma,  lead  to  closed-form  marginals,  whereas  others  do  not.  However,  this  con¬ 
sideration  is  one  of  mathematical  and  computational  convenience  only  and  hence  is 
not  necessarily  compelling  on  its  own.  Unfortunately,  one  rarely  has  guidance  from 
economic  theory  on  this  aspect  of  duration  modeling. 

A  second  consideration  is  generality  and  flexibility.  The  gamma  model  is  quite 
flexible  and  has  many  attractive  properties.  However,  the  inverse-Gaussian  may  better 
handle  heavy-tailed  distributions.  Both  of  these  are  one-parameter  families  (after  nor¬ 
malization).  Hougaard  (1986)  introduced  a  more  flexible  two-parameter  family  that 
has  gamma  and  inverse-Gaussian  as  special  cases.  Later  in  this  chapter  we  consider  a 
discrete  (nonparametric)  representation  that  also  affords  considerable  flexibility. 


18.2.3.  Weibull-Gamma  Mixture 


Next  we  consider  the  popular  Weibull-gamma  mixture,  which  can  be  specialized  to 
the  exponential-gamma  case.  This  model  is  a  leading  special  case  of  a  mixed  propor¬ 
tional  hazards  (MPH)  model.  The  Weibull-gamma  mixture  is,  of  course,  of  indepen¬ 
dent  interest  because  of  its  greater  generality,  and  especially  because  it  will  be  shown 
to  encompass  both  increasing  and  decreasing  hazards. 

The  survivor  function  conditional  on  multiplicative  v  for  the  Weibull  model  is 

5(t|v)  =  exp(-/zt“v),  A  >  0,  a  >  0,  (18.6) 


where  /i  replaces  a  used  in  Chapter  17. 

The  unconditional  survivor  function  is  given  by  the  average  survivor  function.  Aver¬ 
aging  across  the  heterogeneous  population  using  the  density  of  v,  g(v),  as  the  weight¬ 
ing  function  yields, 


S(t)  =  Ev[S(t\v)]  =  /  S(t\v)g(v)dv. 


f 


(18.7) 


Different  choices  of  g(v)  lead  to  different  mixtures.  With  appropriate  changes  in  in¬ 
terpretation  both  continuous  and  discrete  distributions  are  valid.  The  integral  in  (18.7) 
may  not  have  an  analytical  solution.  For  example,  if  g(v)  is  the  log-normal  density  the 
integral  does  not  have  an  analytical  solution  but  if  it  is  a  gamma  distribution  it  does. 
For  mathematical  convenience  we  work  with  the  gamma  case  in  what  follows. 

Given  gamma  heterogeneity  the  unconditional  survivor  function  is 


-L 


S{t)=  I  exp(— (itav) 


Skvk-1  exp(-8v) 


dv 


TO k) 


f 


r  (k) 

vk~l  exp(— v(/uf"  +  S))dv. 


(18.8) 


To  obtain  the  mixture  density  we  solve  the  integral.  Letting  fita  +  <5  =  we  get 

sk  r°°  {vpf-1 


S(t)  = 


rot) 


/ 


Qk- 1 


exp  (-vfi)dv. 
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Define  y  =  vft,  so  that  dv  =  ft  ldy  and 

Sk  f°° 

S{t)=mwL  /_lexp( ~y)dy 

sk  r (k) 

~  f W)  (dta  +  W 

=  8k(/j,ta  +  8)~k 

=  [i  +  (i8.9) 

where  the  second  line  is  obtained  using  the  definition  of  r(k)  and  substituting  for  ft. 

The  unconditional  duration  density  function  is  obtained  by  differentiating  with  re¬ 
spect  to  f  and  multiplying  by  —1,  which  yields 

f(t)  =  k-nata~l[  1  +  inta/8)]-(k+l).  (18.10) 

0 

The  unconditional  hazard  function  X(t)  =  f(t)/S  if)  is  given  by 

k 

lit)  =  -iiata-l[  1  +  (/rf“/5)]_1.  (18.11) 

8 

These  general  expressions  can  be  specialized  by  setting  the  mean  of  v  at  1;  that  is, 
set  k  =  8,  which  normalizes  E[v]  =  1,  and  leads  to  the  following  expressions  for  the 
Weibull— gamma  mixture : 


S(t)=  {\  +  {nta/8)T\  (18.12) 

fit)  =  - 1  +  ilxta/S))-(s+1\  (18.13) 

ot 

\{t)  =  -aln5(r)  =  iiat^-f  1  +  iiitc'/8)]-\  (18.14) 

3 1 

which  tends  to  the  Weibull  hazard  as  the  variance  1  /8  goes  to  zero. 

The  Weibull  model  permits  either  increasing  or  decreasing  hazards  but  somewhat 
restrictively  assumes  conditionally  monotonic  hazards  at  the  individual  level.  Yet  this 
mixture  distribution  has  been  popular  in  the  econometrics  literature,  mainly  because 
of  its  convenient  properties;  see  Lancaster  (1979)  and  Narendranathan,  Nickell,  and 
Stern  (1985). 

To  specialize  the  results  to  the  exponential-gamma  mixture  set  a  =  1. 
This  yields  S(t)  =  [1  +(jxt/S)Ts,  fit)  =  fi[l  +  ifit / 8)r(S+1) ,  and  A .(f)  =  /x[l  + 
(/xf /5 )]  1 .  The  exponential-gamma  mixture,  also  known  as  the  Pareto  distribution 
of  the  second  kind,  has  more  mass  in  the  tails  relative  to  the  exponential.  The  dif¬ 
ference  between  the  two  depends  on  the  variance,  1/8.  The  r tli  moment  exists  only 
if  8  >  r. 
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18.2.4.  Interpreting  the  Mixture  Hazard  Function 

An  important  issue  in  economic  applications  is  whether  positive  or  negative  duration 
dependence  is  present  in  duration  data.  For  example,  does  the  probability  of  exiting 
from  unemployment  increase  (e.g.,  owing  to  worker  is  reservation  wage  falling)  or 
decrease  (e.g.,  owing  to  the  worker  being  viewed  as  damaged  goods)  as  the  length 
of  the  unemployment  spell  increases?  In  the  iid  case  this  can  be  easy  to  establish  by 
nonparametric  estimation  methods.  With  non-iid  data,  however,  a  decreasing  hazard 
in  the  raw  data  may  be  due  to  aggregating  across  different  individuals,  each  of  whom 
has  a  different  constant  hazard  rate,  or  to  an  decreasing  hazard  for  each  individual. 
Distinguishing  between  the  two  can  be  difficult. 

Consider  the  problem  of  interpreting  the  hazard  function  in  the  presence  of  unob¬ 
served  heterogeneity  in  the  exponential-gamma  mixture.  Notice  that  even  if  individual 
hazard  (i.e.,  hazard  conditional  on  v)  is  constant  at  /i.  the  average  or  aggregate  hazard 
X(t)  is  declining  in  t.  This  does  not  mean  that  there  is  negative  duration  dependence 
in  the  individual  hazard  rate.  Rather,  it  is  the  effect  induced  by  aggregation  across  in¬ 
dividuals  who  differ  randomly  in  their  hazard  rates.  A  similar  erroneous  interpretation 
can  occur  in  the  Weibull-gamma  case.  In  that  case  the  actual  slope  of  the  hazard  func¬ 
tion  depends  on  a,  but  the  slope  of  the  average  or  aggregate  hazard  function  is  affected 
by  the  presence  of  heterogeneity.  Thus  the  neglect  of  unobserved  heterogeneity  may 
lead  to  underestimation  of  the  slope  of  the  hazard  function.  This  result  seems  fairly 
general  (see  Lancaster,  1990).  Salant  (1977)  provided  an  early  extensive  discussion  of 
this  phenomenon. 

This  result  is  the  basis  of  the  claim  (see,  for  example,  Lancaster,  1979;  Heckman 
and  Singer,  1984a)  that  the  estimation  of  hazard  function  in  the  presence  of  neglected 
unobserved  heterogeneity  may  lead  to  serious  biases.  Our  discussion  motivates  tests 
of  unobserved  heterogeneity  in  hazard  models.  Let  us  examine  the  argument  in  the 
context  of  the  Weibull  mixture  model  for  which  S(t)  =  J  exp  (— /rf“v)  g (v)dv.  The 
aggregate  hazard  function  is 


r  ainsmv) 

m  =  -  J  — ^^g(v)dv 

a-i  [  vexp (~fitav) 
~ a'1  J  S(t\v) 

—  a/if“-1E  [v\T  >  t] . 


g(v)dv 


Because  E[v|T  >  t  ]  is  the  average  of  v  over  those  surviving  at  time  t,  it  must  decrease 
with  time  as  individuals  with  higher  values  of  v  leave  the  state  sooner  than  individ¬ 
uals  with  low  values  of  v.  This  changes  the  slope  of  the  aggregate  hazard  function. 
This  phenomenon  can  also  be  thought  of  as  a  form  of  selectivity  bias  (Chapter  16.5). 
Formally,  the  average  of  v  over  time  can  be  written  as 

/v  exp(— atav) 

- — - - -g(v)dv. 

S(t\v) 
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Therefore,  for  the  Weibull  mixture  model 


dE[v\T>t]  ! 

-  =  —aiit 

8t 


-  a  fit' 


a— l 


f 

i 


v2  exp (—/xtav) 

S(t\v) 

v  exp(— /itav) 


g(v)dv 


g(v)dv 


S(t \v) 

=  —afita~l  {E[v2|r  >  f]  -  (E[v|T  >  f])2} 
=  -aiita-lV[v\T  >  t] 

<  0. 


(18.15) 


Hence,  neglecting  heterogeneity  results  in  an  estimated  hazard  rate  that  is  falling  faster 
or  rising  more  slowly  than  the  actual  hazard  rate. 

Another  interesting  comparison  between  models  with  and  without  heterogeneity  is 
the  proportional  impact  of  a  change  in  a  covariate  on  the  hazard  rate.  In  the  absence  of 
heterogeneity 

lnA.(f|/x)  =  In  (/xr“_1)  +  In  a, 
and  the  proportional  impact  of  a  change  in  xj  on  //  is 

9  In  A(f  |/x)  _ 
dxj  ~ 

which  is  a  property  of  the  proportional  hazard  model. 

Allowing  for  unobserved  heterogeneity 

lnX(riri)  —  In {fJtt a  l )  +  ln«  +  InE [v\T  >  f] 

=  In  a  +  In  /r  +  (a  —  l)lnf  +  lnE[v|T  >  f] , 

whence,  noting  that  In  /i  =  x'/3  and  3E[v|r  >  t]  /dxj  =  —  /itaV[v\T  >  r]  [ij ,  it  fol¬ 
lows  that  for  the  Weibull  mixture  model 
3  lnA(r|/r,  v) 


<Pr 

The  result  shows  that  given  heterogeneity  the  proportional  impact  of  a  change  in  Xj  is 
smaller  and  depends  on  t  and  is  no  longer  of  the  proportional  hazard  type.  Thus,  the 
estimates  derived  from  the  model  may  be  misleading  even  if  the  unobserved  hetero¬ 
geneity  term  is  uncorrelated  with  the  included  covariates. 

Similar  consequences  of  unobserved  heterogeneity  for  models  more  general  than 
the  Weibull  are  discussed  in  Lancaster  and  Nickell  (1980). 


1  - 


fitaV[v\T  >  t ] 
EM  T>t]  I 


(18.16) 


18.3.  Identification  in  Mixture  Models 

Associated  with  mixture  models  is  a  general  identification  problem.  This  issue  con¬ 
cerns  the  logical  possibility  of  decomposing  the  individual  contributions  to  the  average 
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survival  probability  of  the  baseline  hazard,  the  unobserved  heterogeneity,  and  the  co¬ 
variates,  given  the  observed  data  (t,  x)  pertaining  to  a  single  spell.  More  specifically,  if 
the  PH  model  were  not  identified,  then  it  would  be  logically  impossible  to  separate  the 
individual  contributions  of  duration  dependence  and  unobserved  heterogeneity.  As  in 
most  discussions  of  identification,  some  restrictions  are  placed  on  the  formulation.  In 
econometric  literature  the  case  of  (mixed)  proportional  hazards  has  been  investigated 
in  detail.  Heckman  and  Singer  (1984b)  and  Elbers  and  Ridder  (1982)  have  established 
the  identification  of  the  MPH  model  under  certain  conditions.  Van  den  Berg  (2001) 
provides  an  excellent  discussion  of  these  earlier  proofs  as  well  as  later  contributions. 

Discussions  of  identifi ability  of  the  MPH  model  begin  with  the  average  or  aggre¬ 
gate  survivor  function 


S(t\x)  =  E„  [S(t\x,  v)]  (18.17) 

=  J  exp  (— v  Ao(f)0(x))  g{v)dv, 

which  assumes  proportionality  of  hazards  as  in  (18.1),  uses  the  PH  formulation  of 
Section  17.8,  but  does  not  make  parametric  assumptions  on  Ao,  0,  or  g.  Here  Ao(f)  = 
/()7  ao  (s)  ds.  The  model  is  said  to  be  nonparame  trie  ally  identified  if,  given  the  data,  the 
functions  ko,  g,  and  0  are  unique.  We  add  the  qualifier  “nonparame trie  ally”  because 
of  the  absence  of  functional  form  assumptions. 

Variations  in  observed  survival  times  are  due  to  variations  in  the  covariates  x,  in 
v,  and  in  the  duration  dependence  function  (baseline  hazard).  Identifiability  means 
a  unique  decomposition  of  the  variation.  A  proof  of  identifiability  must  show  that 
these  separate  contributions  are  in  principle  identifiable.  Most  of  the  available  proofs 
use  advanced  mathematical  tools  to  show  that  the  likelihood  function  can  be  uniquely 
decomposed.  Melino  and  Sueyoshi  (1990)  provide  a  simpler  proof. 

The  conditions  required  for  nonparametric  identification  include  the  following: 
(i)  The  heterogeneity  term  v  is  assumed  to  be  time-invariant  and  independently  dis¬ 
tributed  of  x.  (ii)  g(v)  is  nondegenerate  and  has  finite  mean  (i.e.,  E[y]  <  oo).  (iii) 
0(x)  >  0  for  all  x.  (iv)  ko(t)  is  continuous  and  positive  on  [0,  oo).  (v)  Observed  ex¬ 
planatory  variables  x  are  linearly  independent  and  have  sufficient  variation.  Different 
proofs  have  some  subtle  variation  on  these  conditions  but  we  will  not  delve  into  these 
here. 

Whereas  the  issue  of  nonparametric  identification  involves  considerable  mathemat¬ 
ical  subtleties,  the  problem  is  also  relevant  in  the  context  of  parametric  models.  If  one 
specifies  parametric  forms  such  as  ko(t\a),  0(x|/3),  and  g(v\y),  then  are  these  func¬ 
tions  unique  given  the  data?  The  answer,  unfortunately,  may  be  “no”  in  many  cases. 
This  means  that  one  investigator  may  estimate  a  particular  mixture  model  with  no 
computational  problems,  and  apparently  “nice”  results  and  meaningful  coefficients. 
However,  this  representation  may  not  be  unique.  Another  investigator  may  produce 
equally  nice  results  under  different  parametric  assumptions  and  with  different  impli¬ 
cations.  That  is,  the  observed  survivor  function  may  be  consistent  with  other  choices 
of  the  baseline  hazard  and  heterogeneity  distributions  (Lancaster,  1990,  chapter  4).  In 
the  terminology  of  Section  2.2,  different  structural  models,  with  substantively  different 
policy  implications,  may  have  the  same  reduced  form.  This  clearly  poses  a  problem  for 
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parametric  applied  work.  One  appealing  solution  is  to  choose  flexible  parametric  forms 
for  hazard  and  heterogeneity,  or  else  to  take  the  semiparametric  approach  of  partial 
likelihood  analysis.  The  discussion  of  this  issue  continues  in  the  next  section. 


18.4.  Specification  of  the  Heterogeneity  Distribution 

The  sensitivity  of  coefficient  estimates  to  alternative  assumptions  about  the  hetero¬ 
geneity  has  been  extensively  discussed  in  the  literature.  Two  apparently  contradictory 
positions  may  be  discerned: 

1.  Parametric  specifications  of  unobserved  heterogeneity  are  often  somewhat  arbitrary. 
They  may  seriously  distort  inferences  about  the  hazard  function.  Hence  a  parametrically 
flexible  or  nonparametric  specification  is  desirable.  See  Heckman  and  Singer  (1984a). 

2.  Parametric  specifications  of  unobserved  heterogeneity  are  relatively  innocuous  if  the 
baseline  hazard  function  is  correctly  specified.  When  the  specification  of  the  hazard 
function  is  in  doubt  and/or  is  incorrect,  the  estimates  produced  using  different  para¬ 
metric  assumptions  for  heterogeneity  may  lead  to  different  estimates  of  the  marginal 
distribution  of  the  data.  See  Manton,  Stallard,  and  Vaupel  (1986). 

The  apparent  contradiction  between  the  two  positions  may  be  resolved  as  follows. 
The  specification  of  the  hazard  function  affects  the  first  moment  of  the  distribution 
of  fit),  whereas  that  of  heterogeneity  affects  its  second  moment,  assuming  that  it  is 
uncorrelated  with  the  observed  covariates.  If  the  hazard  function  is  correctly  speci¬ 
fied,  then  the  main  impact  of  the  heterogeneity  distribution  would  be  on  the  relative 
efficiency  of  the  estimator. 


18.4.1.  Discrete-Time  PH  with  Gamma  Heterogeneity 

The  preceding  considerations  suggest  that  a  proportional  hazard  formulation  with  an 
arbitrary  hazard  function  makes  an  attractive  model  with  which  to  combine  a  specific 
heterogeneity  assumption.  Han  and  Hausman  (1990)  and  Meyer  (1990)  combine  the 
gamma  heterogeneity  assumption  with  the  discrete-time  proportional  hazard  model 
developed  in  Section  17.10.  They  report  that  when  the  baseline  hazard  is  not  parame¬ 
terized  estimates  show  little  sensitivity  to  alternative  functional  forms  for  g(v). 

For  specificity  reconsider  (17.43)  after  including  a  heterogeneity  term: 


which  can  be  substituted  into  the  expression  for  log-likelihood  (17.44).  The  het¬ 
erogeneity  term  needs  to  be  integrated  out.  Han  and  Hausman  give  a  closed-form 
expression  under  the  gamma  heterogeneity  assumption  and  report  results  that  indi¬ 
cate  relatively  minor  sensitivity  to  parametric  assumptions  given  their  flexible  hazard 
specification. 
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18.4.2.  Some  Other  Models  for  Heterogeneity 

The  preceding  discussion  emphasized  the  computational  convenience  of  Weibull- 
gamma  model,  which  has  a  closed  form. 

If  the  tail  of  the  observed  marginal  distributions  is  thicker  than  is  consistent  with  the 
gamma  or  log-normal,  one  may  consider  a  member  of  the  Mandelbrot  stable  family  of 
distributions.  Hougaard  (1986)  proposed  a  very  general  family  that  nests,  for  example, 
the  gamma  and  inverse-Gaussian  families  (also  see  Jaggia,  1991b).  A  strictly  stable 
distribution  obeys  the  condition  that  the  sum  of  p  independent  realizations  should 
have  the  same  distribution  as  a  scale  factor  times  the  distribution.  Hougaard  (2000, 
appendix  3.3)  provides  a  summary  of  its  properties. 

Although  a  more  highly  parameterized  heterogeneity  distribution  looks  attractive 
because  of  its  greater  generality,  it  may  lead  to  two  kinds  of  problems.  The  first  prob¬ 
lem  is  that  the  available  data  may  not  be  sufficiently  rich  to  allow  us  to  identify  or 
precisely  estimate  the  parameters.  Often  this  situation  cannot  be  recognized  without 
attempting  estimation  in  the  first  place. 

The  second  problem  is  computational.  If  the  mixture  density  does  not  have  a  closed 
form,  it  is  then  left  in  the  form  of  an  integral.  The  resulting  likelihood  function  has 
terms  that  are  also  integrals.  Estimation  requires  the  use  of  computer-intensive  nu¬ 
merical  methods  such  as  numerical  or  Monte  Carlo  integration  that  were  discussed  in 
Chapter  12.  An  example  of  a  mixture  model  that  requires  such  estimation  techniques  is 
the  Weibull-log-normal  mixture  in  which  unobserved  heterogeneity  has  a  log-normal 
distribution.  Simulation-based  estimation  of  heterogeneity  models  is  discussed  by 
Gourieroux  and  Monfort  (1991,  1996)  and  considered  as  an  example  in  Section  12.2. 

18.5.  Discrete  Heterogeneity  and  Latent  Class  Analysis 

The  preceding  analysis  assumed  a  continuous  distribution  of  unobserved  heterogeneity 
and  concentrated  on  estimation  of  the  parameters  of  that  distribution. 

An  alternative  approach  assumes  that  the  sample  of  individuals  is  drawn  from  a  pop¬ 
ulation  that  consists  of  a  finite  number  of  latent  classes,  say  q ,  and  that  each  element 
in  the  sample  can  be  regarded  as  a  draw  from  one  of  these  q  latent  sub-populations 
or  strata.  This  model  is  known  variously  as  the  finite  mixture  model,  semiparametric 
heterogeneity  model  (Heckman  and  Singer,  1984a),  and  latent  class  model  (Aitken 
and  Rubin,  1985).  Its  attractive  feature  is  that  it  leads  to  a  flexible  parametric  distri¬ 
bution.  In  duration  modeling  the  model  has  been  analyzed,  advocated,  and  applied  by 
Heckman  and  Singer  (1984a). 

Although  these  popular  models  are  presented  in  the  context  of  duration  models,  a 
general  notation  is  used  to  emphasize  the  potential  for  application  elsewhere;  see,  for 
example,  Section  20.4. 


18.5.1.  Finite  Mixture  Model 

Consider  the  following  two-component  finite  mixture  model.  If  the  sample  is  a  proba¬ 
bilistic  mixture  from  two  subpopulations  with  pdf  /i(t|/ii(x))  and  /ottl/ii  (x)),  then 
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7r/i(-)  +  (1  —  TTjfii-),  where  0  <  tt  <  1,  defines  a  two-component  finite  mixture. 
That  is,  observations  are  draws  from  f\  and  /2,  with  probabilities  tt  and  1  —  tt,  respec¬ 
tively.  The  parameters  to  be  estimated  are  (tt,  /i  \ .  n 2)-  The  parameter  jt  may  be  treated 
as  constant  or  may  be  further  parameterized  using,  for  example,  the  logit  function. 
Thus  tt  =  exp(A)/[  I  +  exp(A)]  and  A  in  turn  may  be  parameterized  in  terms  of  further 
observable  covariates.  Thus  we  think  of  two  types  of  individuals,  those  that  come  from 
/i(-)  and  those  that  come  from  /2O.  Perhaps  there  may  be  an  a  priori  case  for  thinking 
along  these  lines,  for  example  if  there  is  some  latent  characteristic  that  partitions  the 
sampled  population  in  this  way.  An  alternative  interpretation  is  simply  that  the  linear 
combination  of  densities  makes  a  good  approximation  to  the  observed  distribution  of  t. 

Generalization  to  additive  mixtures  with  three  or  more  components  is  in  principle 
straightforward  but  subject  to  potential  problems  of  the  identifi ability  of  the  compo¬ 
nents.  This  is  discussed  further  later  in  the  chapter.  Therefore,  it  is  very  helpful  in 
empirical  application  if  the  components  have  a  natural  interpretation.  At  the  simplest 
level  we  think  of  each  subpopulation  as  a  “type,”  but  in  many  situations  a  more  infor¬ 
mative  interpretation  may  be  possible  (Lindsey,  1995). 

Another  interpretation  of  the  finite  mixture  model  is  in  terms  of  a  discrete  represen¬ 
tation  of  population  heterogeneity.  Suppose  the  population  consists  of  m  homogeneous 
subpopulations,  usually  called  components.  A  parametric  model,  such  as  the  Weibull 
or  exponential,  is  supposed  to  apply  to  each  component.  Assume  that  the  /  th  compo¬ 
nent  is  a  fraction  tt j  of  the  total  population,  =  1  • 

Formally,  the  problem  is  formulated  as  follows:  In  all  previous  examples  the  dis¬ 
tribution  of  the  unobserved  heterogeneity  term  has  infinite  points  of  support.  If  the 
continuous  mixing  distribution  g(v,)  can  be  approximated  by  a  discrete  distribution, 

denoted  by  jtj  (  j  =  1 . m)  with  a  finite  number,  m,  of  support  points  then  the 

marginal  (mixture)  distribution  is 

m 

h(ti\xi,7Tj,  (3)  =  f(tj\xi,vr  f3)TTj(vj),  (18.18) 

j= 1 

where  Vj  is  an  estimated  support  point  and  ttj  is  the  associated  probability.  This  semi- 
parametric  representation  of  unobserved  heterogeneity  was  examined  by  Heckman 
and  Singer  (1984a)  in  duration  modeling.  Closely  related  work  is  that  of  Wedel  et  al. 
(1993),  where  the  latent  class  interpretation  is  favored.  If  the  mixing  distribution  tt;  is 
not  subject  to  any  parametric  assumptions,  then  the  mixture  model  is  called  a  semi- 
parametric  mixture  model  for  t. 

The  estimation  of  the  finite  mixture  model  may  be  carried  out  under  the  assumption 
of  either  known  or  unknown  number  of  components.  If  the  fractions  ttj  are  known, 
maximum  likelihood  estimates  of  the  component  distributions  can  be  estimated.  More 
usually  the  proportions  TTj,  j  =  1, . . . ,  m,  are  unknown  and  the  estimation  involves 
both  the  TTj  and  the  component  parameters.  The  maximum  likelihood  estimator  for  the 
latter  case  is  called  nonparametric  maximum  likelihood  estimator  (NPMLE).  Here  the 
nonparametric  component  is  the  number  of  classes,  but  it  is  strictly  a  semiparametric 
method  because  it  is  combined  with  parametric  models  for  the  components.  If  the 
number  of  components  is  unknown,  as  is  usually  the  case,  then  some  delicate  issues  of 
inference  arise.  See  Section  18.5.4  for  details. 
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An  obvious  motivation  for  the  finite  mixture  class  is  that  it  is  a  natural  and  sim¬ 
ple  way  to  treat  population  heterogeneity.  In  many  situations  it  is  simpler  to  think  of 
unobserved  heterogeneity  in  terms  of  a  small  number  of  latent  classes  rather  than  a 
continuum  of  “types”  as  in  Section  18.2. 


18.5.2.  Latent  Class  Interpretation 

The  finite  mixture  model  is  related  to  latent  class  analysis  (Aitkin  and  Rubin,  1985; 
Wedel  et  al.,  1993).  Let  d,  =  {da,  . . . ,  dnn)  define  an  indicator  (dummy)  variable  such 
that  dij  =  1(E;-  djj  =  1)  indicates  that  f,  was  drawn  from  the  /  tli  (latent)  group  or 
class  for  i  =  1,  •  ■ . ,  N.  That  is,  each  observation  may  be  regarded  as  a  sample  from 
one  of  the  m  latent  subpopulations,  classes,  or  “types.”  In  the  discussion  that  follows 
we  assume  that  the  model  is  identified. 

The  model  specifies  that  (7,  \d-t ,  /r.  7 r)  are  independently  distributed  with  densities 

m  m 

E  dijfitilni)  =  n  muj)*,  (18.19) 

y=l  7  =  1 

where  /xj  =  iMxj,  /3,),  fi  =  {du  •  •  • ,  dm)  >  and  (d,  |  /r,  7r)  are  iid  with  multinomial 
distribution 

m  ,  m 

n  7T“"  ,  0  <  JTj  <  1 ,  E  *j  =  1.  (18.20) 

7=1  7=1 

The  last  two  relations  imply  that 

...  m  , 

U,  |/X,  7T)  ~  E  *j °  fj (t\dj  )‘l“  , 

7=1 

which  leads  to  the  likelihood  function 

N  m  , 

L(/3, 7T|t)  =ni  dj)dil •  (18.21) 

1=17=1 


18.5.3.  EM  Algorithm 

This  likelihood  function  may  be  maximized  directly  or  by  applying  the  EM  algorithm 
in  which  the  variables  d  =  (d\ , ,  dn )  are  treated  as  missing  data;  see  Section  10.3. 
If  the  d  were  observable  the  log-likelihood  of  the  model  would  be 

N  m  N  m 

In  L(/x,  7r|t,  d)  =  E  E  da  In  fjitddj)  +EI  du  Imtj.  (18.22) 
1=17=1  i=i 7=1 


If  JTj ,  j  =  1 ,...?«,  are  given,  the  posterior  probability  that  observation  f,  belongs  to 
the  population  j,  j  =  1,2,...  m,  denoted  Zij,  is  given  by 


Zij  =  Pr[y;  e  population  j]  = 


•T/,//(.v;  x,. ) 


(18.23) 
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The  average  value  of  ztj  over  i  is  the  probability  that  a  randomly  chosen  individual 
belongs  to  subpopulation  j.  This  equals  jtj : 

E[-,v]  =  71  j. 

Suppose  we  have  available  an  estimate^)/  of  E[rZ(/  ].  Then,  conditional  on  this  estimate 
we  have 

N  m  N  m 

EL(/3j, . . . ,  (3m,  7r|t,z,  X!, . . . ,  x,„)  =  ^  Etij  n(xj,f3j)  +  Etij  lnjrT 

i=  1  j= 1  i=l  7=1 

(18.24) 

which  constitutes  the  E-step  of  the  EM  algorithm.  The  M-step  of  the  algorithm  maxi¬ 
mizes  EL  by  solving  the  first-order  conditions 

m 

7Tj  -  N~]  =  °-  j  =  1 .  •  •  •  >  m ,  (18.25) 

1  =  1 


N  m 

X  X  zij 


»'=  1  7=1 


9ln/;(f,|/3/) 

d(3j 


=  0. 


(18.26) 


Next  we  can  use  (18.23)  to  get  new  values  of  zy  and  iterate  through  the  E-  and  M- 
steps.  Once  the  process  converges  the  variances  can  be  computed  using  either  the  in¬ 
formation  matrix  or  the  robust  formula. 


18.5.4.  Choosing  the  Number  of  Latent  Classes 

The  first  important  issue  concerns  the  choice  of  m,  the  number  of  components.  Often 
there  is  no  guiding  prior  theory  and  the  choice  is  usually  made  on  pragmatic  grounds. 
Because  the  dimension  of  parameters  to  be  estimated  is  m  dim[/3]  +  m  —  1,  the  num¬ 
ber  of  parameters  can  be  quite  large.  This  number  can  be  decreased  somewhat  if  some 
elements  of  / 3  are  restricted  to  be  equal.  One  popular  method  involves  allowing  the  in¬ 
tercept  to  vary  but  restricting  the  slope  parameters  to  be  the  same  across  groups  (as  in 
(18.18)).  However,  there  is  clearly  an  incentive  to  keep  m  small  if  all  parameters  are  al¬ 
lowed  to  vary  across  classes.  Even  when  only  the  intercepts  are  allowed  to  vary,  many 
applications  use  m  =2.  A  sensible  strategy  is  to  start  with  m  =  2,  and  then  check  the 
fit  of  the  model  using  diagnostic  tests.  An  additional  component  is  added  if  the  fit  is 
poor.  Adding  components  that  cannot  be  reliably  differentiated  is  problematic.  When 
intergroup  differences  are  small,  the  finite  mixture  representation  is  not  needed.  The 
most  desirable  situation  is  one  in  which  the  components  have  an  interpretation.  Choice 
between  models  of  different  dimensions  can  be  made  using  the  penalized  likelihood 
criterion  (AIC  or  BIC),  see  Section  8.5.1.  The  likelihood  ratio  test  is  not  appropriate 
because  of  the  parameter  boundary  hypothesis  problem.  Baker  and  Melino  (2000)  de¬ 
scribe  a  Monte  Carlo  experiment  that  dramatically  reveals  the  potential  pitfalls  of  over¬ 
parameterization  in  a  model  in  which  both  duration  dependence  and  heterogeneity  are 
flexibly  specified  owing  to  a  desire  to  avoid  misspecification.  For  model  selection  they 
recommend  comparing  a  penalized  likelihood  criterion  across  candidate  latent  class 
models,  with  a  high  penalty  for  more  parameters. 
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When  the  model  is  overparameterized  the  parameters  cannot  be  identified.  The 
problem  may  manifest  itself  by  the  presence  of  multiple  optima  or  a  flat  likelihood 
surface.  The  computational  algorithm  may  converge  to  different  points  depending  on 
the  starting  values. 

A  model  selected  from  competing  models  using  the  penalized  likelihood  criterion 
may  not  necessarily  describe  the  sample  data  well.  This  can  only  be  ascertained  by 
a  suitable  goodness-of-fit  test  and  model  diagnostics.  Essentially  one  compares  the 
actual  and  fitted  distribution  of  durations;  a  significantly  large  deviation  between  the 
two  indicates  that  the  systematic  component  of  the  model  does  not  adequately  explain 
the  observed  sample  variation.  Some  possibilities  are  considered  in  the  next  section. 


Computational  Considerations 

A  second  issue  concerns  the  choice  of  computer  algorithm.  Whereas  the  EM  algorithm 
is  very  helpful  in  understanding  the  computational  structure  of  the  problem,  in  practice 
it  often  tends  to  be  slow.  The  authors  have  found  many  instances  in  which  the  Newton- 
Raphson  algorithm  based  on  numerical  derivatives  has  produced  satisfactory  results. 
See  Haughton  (1997)  for  a  survey  of  alternatives.  No  matter  which  algorithm  is  used,  if 
the  intergroup  differences  are  small,  the  likelihood  surface  will  tend  to  exhibit  several 
local  maxima.  In  any  case,  a  single  unique  maximum  is  not  guaranteed. 

All  finite  mixture  models  are  unidentified  in  the  sense  that  the  distribution  of  the 
data  is  unchanged  if  the  subpopulation  labels  are  permuted.  That  is,  relabeling  “com¬ 
ponent  1”  as  “component  2,”  or  vice  versa,  makes  no  difference.  This  problem  can  be 
dealt  with  by  specifying  either  the  itj  or  /i  ;  to  be  nondecreasing.  It  is  desirable  that  the 
component  labels  have  some  behavioral  interpretation. 

One  potential  limitation  of  the  finite  mixture  model  is  that  additional  components 
may  simply  reflect  the  presence  of  outliers.  Though  this  is  not  necessarily  a  bad  thing, 
it  is  useful  to  be  able  to  identify  the  outlying  observations  that  are  responsible  for  one 
or  more  components.  Equation  (18.23)  can  be  useful  in  this  regard.  Postestimation  one 
could  calculate  the  posterior  probability.  For  outliers  these  probabilities  will  be  large 
with  respect  to  one  component  and  small  with  respect  to  the  rest. 


18.6.  Stock  and  Flow  Sampling 

In  many  practical  situations  the  following  question  arises:  What  is  the  relationship 
between  two  or  more  different  average  duration  measures  that  are  available?  From  de¬ 
mography  comes  the  well-known  distinction  between  average  age  and  expected  life 
span.  In  real  estate  there  is  the  distinction  between  the  average  period  that  a  property 
offered  for  sale  has  remained  unsold  and  the  expected  period  before  which  a  newly 
added  property  for  sale  will  be  sold.  Often  the  first  concept  is  used  in  popular  discus¬ 
sions  when  the  second  may  be  more  relevant.  In  economics  there  is  a  similar  question 
about  the  relationship  between  different  measures  of  unemployment  duration  that  are 
published  by  government  statistical  agencies.  The  issue  of  unobserved  heterogeneity, 
as  it  pertains  to  the  pool  of  the  unemployed,  and  to  the  flow  into  that  pool,  is  closely 
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involved  in  these  discussions.  One  of  the  earlier  influential  discussion  of  these  issues 
was  given  in  Salant  (1977). 

For  specificity,  let  us  focus  on  the  familiar  example  of  unemployment  duration. 
One  statistic  that  measures  the  unemployment  experience  of  an  already  unemployed 
individual,  published  by  statistical  agencies  in  many  countries,  is  the  average  inter¬ 
rupted  duration  (AID),  which  is  the  average  period  for  which  members  of  the  current 
stock  of  unemployed  have  been  unemployed.  It  is  an  estimate  of  the  expected  elapsed 
duration,  a  period  for  which  a  newly  unemployed  individual  can  expect  to  remain 
unemployed,  often  referred  to  as  average  duration  of  a  complete  spell  of  unemploy¬ 
ment  (ACD),  a  measure  that  features  prominently  in  the  job  search  literature  and  is  the 
one  that  the  current  and  previous  chapters  have  concentrated  on.  This  is  an  estimate  of 
the  expected  length  of  a  completed  duration.  We  may  think  of  AID  as  a  stock-based 
measure  and  ACD  as  a  flow-based  measure;  the  former  is  analogous  to  average  age 
in  a  population  and  the  latter  to  the  expected  life  span.  The  question  of  interest  is  the 
relationship  between  the  two. 

The  appropriate  statistical  tool  for  handling  issues  such  as  these  is  renewal  theory. 
The  stationary  Poisson  process  with  constant  intensity  parameter  is  an  example  of  a 
renewal  process.  The  number  of  renewals  in  a  time  interval  dt  refers  to  the  num¬ 
ber  of  events.  Duration  is  the  time  between  successive  occurrences  of  events  (i.e.,  re¬ 
newals).  For  an  individual  in  a  given  state  the  backward  recurrence  time  refers  to  the 
elapsed  duration  since  renewal,  and  forward  recurrence  time  refers  to  the  duration 
from  current  state  to  a  transition.  The  expected  number  of  events,  denoted  E[A  (f)] , 
in  the  time  interval  (0,  t)]  is  called  the  renewal  function  and  lim,/,^o  dE[  N  (t)\  /dt  is 
the  renewal  intensity,  which  determines  the  relationship  between  ACD  and  the  aver¬ 
age  backward  recurrence  time.  In  what  follows,  we  concentrate  on  some  well-known 
results. 

Salant  (1977)  showed  that  heterogeneity  in  hazard  rates  provides  a  key  to  under¬ 
standing  the  differences  between  AID  and  ACD.  His  diagrammatic  representation 
provides  intuition  into  the  two  key  factors  that  affect  the  calculated  averages.  In  Fig¬ 
ure  18.1  the  vertical  axis  measures  calendar  time  and  the  horizontal  axis  represents  the 
date  of  the  survey.  Stock  sampling  refers  to  sampling  in  the  survey  period  from  the 
stock  of  individuals  who  are  then  in  a  given  state.  In  contrast,  flow  sampling  means 
that  we  sample  those  who  enter  the  state  during  a  particular  interval.  The  lengths  of 
spells  in  progress  are  shown  as  vertical  lines.  For  illustration  nine  realizations  of  spells 
are  shown  and  four  of  these  (S6,  S7,  S8,  and  S9)  are  in  progress  on  the  survey  date. 
Five  spells  (SI,  S2,  S3,  S4,  and  S5)  are  completed  during  the  12-month  survey  period. 
If  Uj  denotes  the  length  of  the  jth  in-progress  spell  sampled  by  the  survey,  then  for 
our  example,  AID  =  1/4 (^  •  U  j  ).  If  t,  denotes  the  length  of  the  /th  completed  spell 
sampled  by  the  survey,  then  ACD  =1/5  ( t,). 

Now  observe  that  the  survey  is  more  likely  to  capture  longer  spells  than  shorter 
spells,  and  this  leads  to  an  upward  bias  that  is  the  result  of  length-biased  sampling. 
This  type  of  bias  is  likely  to  lead  to  AID  >  ACD.  However,  because  the  survey  mea¬ 
sures  only  incomplete  durations,  the  average  of  such  incomplete  durations  is  likely 
to  be  shorter  than  the  average  of  the  completed  durations.  This  is  the  phenomenon 
of  interruption  bias.  The  answer  to  the  question  of  which  source  of  bias  dominates 
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Figure  18.1:  Length-biased  sampling  under  stock  sampling:  examples. 


depends  on  the  distribution  of  spell  lengths,  and  this  in  turn  depends  on  the  distribu¬ 
tion  of  hazard  rates.  Heterogeneous  hazard  rates  provide  a  key  to  understanding  the 
relationship  between  the  two. 

The  key  assumption  is  that  of  a  stationary  environment  which  refers  to  a  situation 
in  which  inflows  into  and  the  outflow  from  the  state  are  equal.  Let  f(u)  denote  the 
density  of  interrupted  spells  and  g(t )  denote  the  density  of  completed  spells.  Then,  the 
distribution  of  u  is  given  by 


f(u)  = 


G(u) 
f  G  (k)  du 


G  (u) 

W 


(18.27) 


where 

G  ( u )  ~  J  8  (a)  dx 

is  the  survivor  function  corresponding  to  be  density  g(u).  and  E[f]  is  the  mean  of 
the  distribution  of  completed  durations.  For  a  full  derivation  of  this  result  and  the 
underlying  assumptions,  see  Salant  (1977)  or  Lancaster  (1990,  Section  5.3). 

An  implication  of  this  result  is  that  if  g(t)  is  exponential,  so  that  the  stochastic 
process  for  the  event  is  the  Poisson  process,  then  /(h)  is  also  exponential,  and  the 
mean  of  both  duration  measures  is  equal. 
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Given  (18.27),  the  general  relationship  between  moments  of  the  distributions  of  u 
and  t  can  be  derived.  One  useful  result  links  the  mean  of  u  to  the  mean  and  variance 
of  t: 

EM=i(EI,I+^)-  "828) 

Another  interesting  result  concerns  the  relationship  between  E[f]  and  the  mean 
completed  duration  of  the  constant  population  with  spells  in  progress  (i.e.,  the  aver¬ 
age  across  the  stock  of  spells  in  progress).  In  line  with  intuition  based  on  length-based 
sampling,  the  relation  is 

E[r(S)]  =  E[r]  +  ^J  >  E[r],  (18.29) 

E[r] 

which  says  that  the  mean  duration  for  the  constant  stock,  denoted  E[f(S)],  exceeds  the 
average  expected  duration  of  a  new  spell.  If  f(t)  is  exponential,  then  E[f(S)]  =  2E[/  ]. 
and  E[w]  =  1/2E[I(S)];  on  average  the  sampled  interrupted  spell  will  be  halfway  to 
completion. 

What  if  the  hazard  rate  is  not  constant?  If  the  hazard  rate  is  increasing  in  spell 
length  (i.e.,  positive  state  dependence)  then  E[h]  <E[f],  and  if  it  is  decreasing  (i.e., 
negative  state  dependence)  then  E[h]  >  E[r]. 

Although  these  results  have  been  obtained  under  the  assumption  of  a  constant  pop¬ 
ulation,  they  have  proved  very  useful  in  interpreting  and  clarifying  the  connections 
among  various  average  measures  of  duration  that  are  commonly  employed.  The  results 
given  here  are  valid  regardless  of  the  reason  for  spell  occurrence.  They  also  motivate 
a  more  careful  investigation  of  the  shape  of  the  hazard  function. 


18.7.  Specification  Testing 

Tests  of  specification  in  duration  models  take  several  different  forms,  including  the 
following: 

•  inclusion  and  exclusion  tests  for  covariates, 

•  tests  of  functional  forms  of  the  survival  function, 

•  tests  of  unobserved  heterogeneity,  and 

•  joint  tests  of  state  dependence  and  unobserved  heterogeneity. 

The  first  type  of  specification  test  does  not  raise  new  problems  and  can  be  handled 
by  Wald-type  tests. 

Tests  of  restrictions  on  functional  form  are  the  same  as  tests  of  unobserved  hetero¬ 
geneity  if  the  restriction  is  the  absence  of  unobserved  heterogeneity.  Because  the  latter 
can  bias  the  estimation  of  the  hazard  rate,  as  shown  in  the  Section  18.2,  diagnostic 
testing  for  unobserved  heterogeneity  is  desirable. 

The  standard  formulation  for  this  is  to  test  whether  the  heterogeneity  (variance)  pa¬ 
rameter  is  zero.  If  this  hypothesis  is  tested  using  the  restricted  model  that  assumes  zero 
heterogeneity,  a  score  test  is  appropriate.  The  use  of  the  likelihood  ratio  or  Wald  test 
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based  on  the  unrestricted  model  will  be  problematic  if  the  hypothesis  is  a  boundary 
hypothesis.  For  example,  in  the  Weibull-gamma  model  (18.9),  the  restriction  1/5  =  0 
will  specialize  the  model  to  the  Weibull,  but  this  is  a  boundary  hypothesis.  The  stan¬ 
dard  one-degree-of-freedom  chi-square  test  has  a  weighted  chi-squared  distribution 
under  the  null. 


18.7.1.  Hypothesis  Tests 

One  type  of  specification  test  is  a  score  test  of  unobserved  heterogeneity  based  on  the 
exponential  null  model.  Because  of  possible  confounding  between  heterogeneity  and 
duration  dependence  it  is  desirable  to  carry  out  a  joint  rather  than  a  separate  test.  This 
can  be  done  using  the  framework  of  a  locally  heterogenous  Weibull  model  (Lancaster, 
1985). 

A  locally  heterogenous  density  is  generated  by  considering  a  Taylor  expansion  of 
an  arbitrary  density  around  v  =  1  of  the  Weibull  density  with  multiplicative  hetero¬ 
geneity  v,  yielding 

S(t\v)  =  e-^  =  e~ev 

=  e~e[l  +  (-e)(v  -  1)  +  (s2/ 2)(v  -  l)2  +  0(e3)], 


where  e  =  fi  t01 .  From  the  second  line 

E[e~ev]  =  WE[1  +  (eV/2)]  =  Sm(t\ 

where  the  term  a 2  is  the  variance  of  the  heterogeneity  distribution. 
Then 

dS,„  (t) 


fn  (t)=  — 


dt 


=  a/it01  le  E[1  +  (e2cr2/2)]  —  e  e[2 s(afita  1)ct2/2] 
=  a[ita-xe~e  [l  +  n2(e2  -  2e)/2] . 


Using  the  last  result  and  allowing  for  censored  observations,  the  log-likelihood  is  given 
by 


lnL(«,  p,  a2)  =  £  In  \[fm(t)f  [Sm  (f)] 
1  =  1 
N 


1  Si  1 


—  [in a  +  (a  —  1 )  In  t,  +  ln/x,-  +  In  (1  +  a2  (e2  —  2 £,)  /2)  —  e, 


i=i 


+  (1  —  £/)  In  (l  +  cr2e2/2)] 


where  5,  is  the  censoring  indicator,  which  takes  the  value  one  for  uncensored  dura¬ 
tions  and  zero  otherwise,  In  /i,  =  fio  +  x'  fj , ,  and  e,  =  /x,r“  is  the  generalized  error 
(Section  18.7.2). 

The  null  hypothesis  of  interest  is  Hq  :  a2  =  0  and  a  =  1.  This  is  a  joint  test  of 
zero  unobserved  heterogeneity  and  the  exponential  distribution  specification.  Let  8  = 
( 9\ ,  0'2) ,  8\  =  [a2,  a ) ,  and  8'2  =  (/So,  (3X) ,  and  let  8'0  =  (0,  1,  /So,  Pi)  denote  the 
restricted  vector. 
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For  simplicity  consider  only  the  case  of  uncensored  data.  Then  the  joint  score  test 
statistic  is 


LMhd  = 


'h'(l)  l 
1  1 


(18.30) 


where  s'  =  [|  JT  (e?  —  2e() ,  (1  +  (1  —  £,)lnfi],  and  T'ir)  denotes  the  first 

derivative  of  the  digamma  function  d  In  F(r)/dr  and  d  =  I  /(ATT'!  I )  —  1)).  To  im¬ 
plement  the  test,  LMhd  is  evaluated  at  the  null  (i.e.,  replacing  all  quantities  by  their 
estimates  under  the  null  of  exponential  distribution).  This  test  statistic  has  an  asymp¬ 
totic  y  2(2)  distribution  (Jaggia  and  Trivedi,  1994). 

Notice  that  the  matrix  of  the  quadratic  form  in  the  LMhd  statistic  is  not  diagonal. 
That  is,  the  two  components  of  the  joint  test  are  correlated.  A  separate  test  of  hetero¬ 
geneity  (duration  dependence)  has  power  against  duration  dependence  (heterogene¬ 
ity).  More  explicitly,  suppose  we  consider  two  separate  score  tests  for  heterogeneity 
and  duration  dependence.  They  are 

lmh  =  ^  (L,  (e,2  ~  2£,))2  ,  (18.31) 

LMd  =  (£,  (1  +  (1  -  e,)lnr,))2 ,  (18.32) 

each  of  which  has  a  y  2(  I )  distribution  under  the  null.  The  separate  test  of  zero  unob¬ 
served  heterogeneity  has  power  against  the  other  null  hypothesis  because  the  tests  are 
correlated,  see  (18.30).  Consequently,  inferring  the  direction  of  misspecification  on  the 
basis  of  a  separate  test  can  be  misleading. 

Because  the  specification  of  unobserved  heterogeneity  and  state  dependence  are 
closely  related,  testing  hypotheses  about  them  separately  can  produce  misleading  re¬ 
sults  (Jaggia  and  Trivedi,  1994).  Formally  speaking,  tests  of  state  dependence  in  the 
presence  of  incorrectly  neglected  heterogeneity  are  biased,  and  the  reverse  is  also  true. 
Jaggia  (1991c)  reanalyzes  strike  duration  data  that  have  been  analyzed  in  a  misleading 
manner  in  the  econometrics  literature,  Jaggia  and  Trivedi  (1994)  develop  some  joint 
tests  for  a  class  of  parametric  models.  See  also  Bera  and  Yoon  (1993)  who  consider 
the  more  general  issue  of  hypothesis  testing  when  the  model  is  misspecified. 

Useful  as  these  tests  are  in  simple  parametric  models,  the  starting  point  of  an  inves¬ 
tigation  might  be  a  Weibull,  Weibull-gamma,  or  proportional  hazard  model.  In  such 
cases  testing  for  unobserved  heterogeneity,  or  any  other  specification  error,  can  be  car¬ 
ried  out  using  the  integrated  hazard  function  because  in  the  absence  of  heterogeneity 
integrated  hazard  is  a  unit  exponential  random  variable.  We  now  discuss  some  graphi¬ 
cal  methods  for  evaluating  the  fit  of  the  model  based  on  integrated  hazard. 


18.7.2.  Graphical  Tools  for  Detecting  Misspecification 

In  Section  8.7.2  we  developed  the  concept  of  generalized  residuals.  In  nonlinear  mod¬ 
els  a  clear-cut  choice  of  such  a  measure  is  difficult.  In  the  present  context  there  is  a 
good  choice. 
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Generalized  Residuals 

A  useful  type  of  test  is  a  nonparametric  graphical  test  of  fit  of  the  duration  model. 
The  test  uses  the  generalized  residual,  which  is  defined  as  a  certain  function  of  data 
and  estimated  parameters.  For  a  correctly  specified  model  the  residuals  should  behave 
approximately  like  an  iid  sample  from  a  known  distribution.  The  integrated  hazard 
turns  out  to  have  such  a  property  and  hence  functions  as  an  ingredient  for  a  residual- 
based  specification  test.  In  the  context  of  duration  models  where  from  Section  17.3.1 

Sit  | /*)  =  exp  [ — A  (r  |/z)] , 

fit \n)  =  k(f  |/r)  exp  [— A(f  |/x)] , 

consider  the  distribution  of  the  generalized  residual 

e  =  A(f  |/x)  (18.33) 

=  ~ln(S(t\n)). 

The  Jacobian  of  this  transformation  is 

|/|  =  dt  /de 

1 

dA(t\fi)/dt 

=  1  AOIaO- 

Given  fit  |/r),  the  transformation  in  (18.33),  and  the  Jacobian  of  transformation,  the 
density  of  e  is  given  by 

A(r|/x)exp(— e)  ]  =exp(-e),  (18.34) 

which  does  not  depend  on  /r;  the  density  is  the  unit  exponential  distribution.  This  result 
was  referred  to  in  Sections  17.3.1  and  17.6.7. 


Diagnostic  Test  Based  on  Integrated  Hazard 

A  diagnostic  test  can  be  constructed  by  exploiting  the  unit  exponential  property  of  the 
generalized  residual  e  under  the  null  of  correct  specification.  The  survivor  function 
of  the  generalized  residual  is  S(e)  =  exp  (— e) .  Hence  —  lnS(e)  =  A(e)  =  e.  For  a 
correctly  specified  model,  a  graphical  comparison  of  the  estimated  integrated  hazard 
with  e"  should  yield  an  approximately  linear  positive  relationship  with  45°  slope.  If  the 
plot  deviates  significantly  from  the  45°  line  a  misspecification  could  be  indicated. 

For  example,  the  estimated  integrated  hazard  for  the  Weibull  model  is  T  =  fit01 . 
Its  survivor  function  is  S  (?)  =  N  1  (number  of  sample  observations  >  e) . 

A  small  formalization  of  this  is  to  regress  —  In  S(e)  on  T  and  an  intercept  and  test 
whether  the  intercept  is  zero  and  the  slope  equals  one. 

The  technique  may  be  applied  to  any  parametric  model  for  which  the  integrated  haz¬ 
ard  expression  is  available.  For  example,  the  generalized  error  for  the  Weibull-gamma 
mixture  (easily  specialized  to  an  exponential-gamma  mixture  by  setting  a  =  1) 
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is  e  =  k  In  [(k  +  fjita)/k].  To  apply  the  test,  compute  T  given  estimates  of  (/z,  a,  k), 
and  then  plot  e"  against  —  In  S  (?). 


Censored  Data 


In  the  case  of  censored  observations  the  observed  duration  t  =  min[T,  L],  where  L 
denotes  the  right-censoring  limit.  If  the  observation  exceeds  L  it  is  censored  at  L.  Then 
the  generalized  error  e(t)  is  not  unit  exponential  distributed.  The  following  derivation 
leads  to  a  relationship  that  suggests  an  adjustment  for  censoring: 


E[e(T)\T  >L]  = 


L 


ef(€) 
(L)  S(€  (L))' 

1 


de 


e-«L) 


/»oo 

/  ‘ 

e(L) 


ee  ede 


=  ^mt1+e<«^“U  +  e"IU 

=  1  +  €{L), 


1] 


(18.35) 


upon  integration  by  parts  and  simplification. 

This  suggests  that  one  might  estimate  the  generalized  error  as  e(t)  =  e(?)  if  data 
are  not  censored,  and  as  ?(f )  =  1  +  7(L)  if  the  observations  are  censored.  Available 
results  suggest  that  this  technique  works  reasonably  well  in  the  censored  exponential 
model  when  the  proportion  of  censoring  is  not  too  heavy  (Jaggia  and  Trivedi,  1994; 
Jaggia,  1997). 


18.7.3.  Conditional  Moment  Tests 

The  conditional  moment  framework  (see  Section  8.2)  applied  to  the  generalized 
residuals  provides  a  fruitful  approach  to  specification  testing.  The  idea  can  be  illus¬ 
trated  in  the  context  of  tests  of  unobserved  heterogeneity. 

The  integrated  hazard  function  was  shown  previously  to  be  distributed  as  a  unit 
exponential  random  variable  with  mean  1  and  variance  1.  In  this  case  the  conditional 
second-moment  restriction  of  interest  is  E[(e  —  1  )]2  =  V[e]  =  1,  or  equivalently 

E  [<r  -  2]  =  0. 

Higher  order  moment  restrictions  can  also  be  generated  and  tested  jointly  or  separately. 
For  details  see  Jaggia  (1991a). 

18.8.  Unobserved  Heterogeneity  Example: 
Unemployment  Duration 

In  this  section,  we  rework  the  empirical  example  of  Section  17.1 1  under  the  assump¬ 
tion  that  unobserved  heterogeneity  is  present  and  can  be  parameterized  within  an  ana¬ 
lytically  tractable  parametric  model. 
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Figure  18.2:  Unemployment  duration:  generalized  residuals  from  the  exponential  model. 
U.S.  data  from  1986-92  on  3343  spells,  some  incomplete. 


As  discussed  in  Section  18.7.2,  we  can  use  a  graphical  tool  to  examine  the  possible 
presence  of  unobserved  heterogeneity  by  looking  at  the  estimated  fit  of  the  model.  For 
a  correctly  specified  model,  the  residuals  should  follow  the  unit  exponential  distribu¬ 
tion.  One  can  evaluate  the  model  fit  informally  by  computing  and  plotting  the  em¬ 
pirical  cumulated  hazard  function  against  the  generalized  residual.  For  a  correctly 
specified  model  the  plot  should  exhibits  an  approximate  straight  line  with  slope 
one. 

Figures  18.2  and  18.3  show  the  generalized  residual  plots  for  the  exponential  model 
without  and  with  (gamma)  heterogeneity,  respectively.  As  we  can  see  from  the  two 
graphs,  the  fit  of  the  model  improves  only  marginally  after  we  introduce  unobserved 
heterogeneity. 
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Figure  18.3:  Unemployment  duration:  generalized  residuals  from  the  exponential-gamma 
model.  Same  data  as  Figure  18.2. 
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Table  18.1.  Unemployment  Duration:  Exponential  Model  with 
Gamma  and  IG  Heterogeneity 


Variable 

Exponential-Gamma 

Exponential-IG 

Coeff. 

t 

Coeff. 

t 

RR 

0.501 

0.817 

0.504 

0.821 

DR 

-0.882 

-1.118 

-0.807 

1.032 

UI 

-1.585 

-6.043 

-1.545 

-5.994 

RRU1 

1.091 

1.725 

1.057 

1.686 

DRUI 

0.057 

0.055 

-0.013 

-0.012 

LNWAGE 

0.379 

3.184 

0.373 

3.156 

CONS 

-4.095 

-4.507 

-4.097 

-4.545 

a2 

0.232 

3.178 

0.207 

2.925 

—In  L 

2695.35 

2696.48 

This  result  can  be  verified  by  the  actual  estimates  shown  in  Table  18.1,  which 
also  presents  the  estimates  of  the  exponential  model  with  inverse-Gaussian  (IG)  het¬ 
erogeneity.  Although  there  is  evidence  of  significant  unobserved  heterogeneity,  the 
estimates  of  coefficients  in  these  two  settings  do  not  differ  much  from  what  we 
have  obtained  earlier  without  the  presence  of  unobserved  heterogeneity.  It  is  ex¬ 
pected  that  the  presence  of  unobserved  heterogeneity  will  have  a  large  impact  on 
the  duration  dependence  parameter,  as  this  factor  is  absent  from  the  exponential 
model. 

However,  a  more  interesting  case  arises  when  we  consider  a  model  with  duration 
dependence  and  unobserved  heterogeneity.  Without  presuming  that  it  is  the  “correct” 
model,  we  consider  the  Weibull  distribution-inverse  Gaussian  mixture  model.  For  ease 
of  comparison,  we  present  these  estimates  in  Table  18.2  along  with  the  estimates  when 
unobserved  heterogeneity  is  neglected. 

The  introduction  of  unobserved  heterogeneity  has  a  substantial  impact  on  the  du¬ 
ration  dependence  parameter,  which  increases  from  1.129  in  Table  17.8  to  1.753  in 
Table  18.2.  The  latter  implies  a  more  steeply  rising  hazard  rate  out  of  unemployment 
than  was  the  case  when  unobserved  heterogeneity  was  ignored.  Recall  from  Section 
18.2.4  that  one  of  the  consequences  of  neglected  heterogeneity  in  proportional  haz¬ 
ards  model  is  to  underestimate  the  hazard  rate;  so  the  aforementioned  empirical  find¬ 
ing  is  consistent  with  theory.  Second,  note  that  the  evidence  for  unobserved  hetero¬ 
geneity  is  very  strong;  the  estimated  variance  parameter  a2  has  a  f -ratio  exceeding 
1 1 .  Third,  the  fit  of  the  model,  as  reflected  in  the  log-likelihood,  has  also  improved 
(from  —2687.6  to  —2616.6).  Although  there  is  not  much  qualitative  change  in  the  es¬ 
timates  of  the  coefficients,  the  effects  of  the  significant  coefficients  (UI,  LNWAGE, 
and  CONS)  have  become  more  pronounced  after  unobserved  heterogeneity  is 
introduced. 

The  improvement  in  the  fit  of  the  model  notwithstanding,  the  new  mixture  model 
could  still  be  misspecified.  Once  again  we  use  the  graphical  device  as  an  informal 
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Table  18.2.  Unemployment  Duration:  Weibull  Model  with  and 
without  IG  Hetorogeneity 


Variable 

Weibull-IG 

Weibull 

Coeff. 

t 

Coeff. 

t 

RR 

0.736 

0.812 

0.448 

0.70 

DR 

-1.073 

-0.933 

-0.427 

-0.53 

UI 

-2.575 

-6.698 

-1.496 

-5.67 

RRU1 

1.734 

1.857 

1.105 

1.57 

DRUI 

-0.061 

-0.039 

-0.299 

-0.28 

LNWAGE 

0.576 

3.259 

0.37 

2.99 

CONS 

-5.303 

-3.953 

-4.358 

-4.74 

a 

1.753 

44.19 

1.129 

51.44 

a2 

6.377 

11.149 

- 

- 

-In  L 

2616.6 

2687.6 

specification  test.  Figures  18.4  and  18.5  plot  the  generalized  residuals  from  the  Weibull 
model  with  and  without  unobserved  heterogeneity.  The  plots  suggest  that  the  mix¬ 
ture  model,  despite  being  more  general  than  the  exponential-IG  model,  appears  to 
be  misspecified.  To  reiterate,  although  a  simpler  model  that  allows  for  neither  du¬ 
ration  dependence  nor  unobserved  heterogeneity  shows  little  graphical  evidence  of 
misspecification,  an  “improved”  specification  that  generalizes  the  model  in  both  direc¬ 
tions  appears  to  be  misspecified,  a  result  similar  to  that  of  Jaggia  (1991c).  The  appar¬ 
ent  puzzle  may  be  resolved  by  the  argument  that  the  interaction  between  heterogeneity 
and  duration  dependence  accounts  for  the  result.  The  Weibull  model  assumes  mono¬ 
tonic  hazards.  However,  McCall  (1996)  provides  evidence  based  on  the  same  data  that 
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Weibull  Model  Residuals 


Generalized  (Cox-Snell)  Residual 


Figure  18.4:  Unemployment  duration:  generalized  residuals  from  the  Weibull  model.  Same 
data  as  Figure  18.2. 
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Weibull-IG  Model  Residuals 


0  1  2  3  4  5 

Generalized  (Cox-Snell)  Residual 

Figure  18.5:  Unemployment  duration:  generalized  residuals  from  the  Weibull-lnverse 
Gaussian  model.  Same  data  as  Figure  18.2. 

a  bathtub- shaped  hazard  function  is  more  appropriate.  He  specifies  a  polynomial  base¬ 
line  hazard  function  that  is  less  restrictive  than  the  monotonic  function  used  here.  Thus 
a  reasonable  interpretation  of  our  results  is  that  a  model  that  simultaneously  allows 
for  both  unobserved  heterogeneity  and  duration  dependence  makes  it  easier  to  detect 
misspecification  than  a  model  that  ignores  both. 

Finally,  we  implement  a  parametric  test  for  the  presence  of  unobserved  heterogene¬ 
ity.  The  purpose  is  to  illustrate  some  of  the  theory  discussed  in  Section  18.7.  The 
score  test  for  neglected  heterogeneity  developed  in  Section  18.7.1  assumed  uncensored 
data.  Because  the  data  used  here  include  right-censored  observations  we  implement  the 
score  test  for  the  censored  sample  developed  by  Jaggia  (1997). 

We  wish  to  test  for  zero  unobserved  heterogeneity,  Hq  :  a2  =  0,  in  the  exponen¬ 
tial  duration  model.  Let  the  parameter  set  be  denoted  by  9  =  (a2,  (3)  and  let  s(0o) 
and  X  (6q)  be,  respectively,  the  score  and  the  information  matrix  calculated  under  the 
null.  Using  the  log-likelihood  derived  in  Section  18.7.1,  we  can  write  s(#o)  =  (Si(0q), 

s2(0o)),  where  Si  (0O)  =  |^L „  =  \  YSA  ~  2C,-e/)  and  X(0O)  =  -E  -The 

0  L  -I  I  Ho 

score  test  for  unobserved  heterogeneity  is  then  given  by 

LM  =  s\(e0)Iu(e0)Sl(d0)  -  x2(l),  (18.36) 

where  X11  =  [Xu  —  X12  (X22)-1  Xii]-1  is  the  first  diagonal  component  of  the  parti¬ 
tioned  inverse  of  1(0),  given  in  Jaggia  (1997),  and  the  tilde  superscript  is  used  to 
denote  restricted  maximum  likelihood  estimates. 

For  our  sample,  we  found  that  LM  =  44.25,  which  far  exceeds  the  critical  value 
of  /2(1)  and  hence  we  reject  the  null  of  o 2  =  0.  This  result  is  consistent  with  that 
from  the  Weibull-gamma  and  Weibull-IG  models  where  a  significant  improvement 
in  the  fit  of  the  model  resulted  from  introduction  of  unobserved  heterogeneity.  As 
previously  noted,  this  test  has  power  against  a  test  of  misspecified  duration  dependence 
also. 
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18.9.  Practical  Considerations 

The  issue  of  interaction  between  hazard  function  and  unobserved  heterogeneity  has 
generated  a  huge  literature.  One  point  of  view  that  is  well  documented  states  essen¬ 
tially  that  if  the  hazard  function  is  well  specified  then  the  precise  parametric  specifi¬ 
cation  of  the  heterogeneity  distribution  is  relatively  innocuous  (Manton  et  al.,  1986). 
This  view  implies  that  rather  than  parametrically  modeling  unobserved  heterogeneity 
we  can  simply  use  robust  variance  estimates,  given  that  the  hazard  function  is  well 
specified.  Other  studies  suggest  that  parametric  specification  of  the  heterogeneity  dis¬ 
tribution  is  not  innocuous  (Heckman  and  Singer,  1984a)  and  that  it  is  desirable  to 
use  a  nonparametric  specification.  Some  highly  influential  work  has  advocated  use 
of  a  discrete  hazard  model  with  a  very  flexible  specification  of  the  hazard  function, 
combined  with  a  parametric  assumption  about  heterogeneity  (Meyer,  1990;  Han  and 
Hausman,  1990).  Finally,  as  a  compromise  between  all  the  foregoing  positions,  some 
researchers  use  the  Han-Hausman  discrete-time  approach,  or  a  high-order  polynomial 
hazard  function,  and  combine  it  with  the  Heckman-Singer  approach  of  nonparametric 
heterogeneity.  However,  as  Baker  and  Melino  (2000)  have  pointed  out,  this  may  lead  to 
overparameterization  that  is  far  from  innocuous.  Hence  it  seems  sensible  to  approach 
this  issue  with  caution,  and  use  parsimonious  models  in  preference  to  models  saturated 
with  heterogeneity  parameters. 

The  Cox  PH  model  has  a  central  place  in  the  biometrics  literature.  When  there  is  no 
intrinsic  interest  in  the  baseline  hazard  function  then  this  seems  an  attractive  choice  of 
functional  form.  It  is  often  a  good  place  to  start  modeling.  However,  unobserved  het¬ 
erogeneity  is  important  in  most  econometric  specifications  and  should  not  be  ignored. 

Many  statistical  packages  offer  a  choice  of  standard  parametric  duration  models  that 
can  be  combined  with  any  of  the  standard  (gamma,  inverse-Gaussian,  or  log-normal) 
heterogeneity  (“frailty”)  specifications.  Although  this  is  a  very  convenient  option  to 
use,  discrete  hazard  models  hold  greater  appeal  as  they  provide  greater  flexibility  and 
a  better  match  with  economic  data. 

The  implementation  of  the  EM  algorithm  for  the  latent  class  model  often  suffers 
from  slow  computational  speed.  Direct  maximization  of  the  likelihood  is  often  both 
feasible  and  efficient. 


18.10.  Bibliographic  Notes 

18.2  There  are  many  papers  that  discuss  the  specification  of  the  heterogeneity  distribution 
and  consequences  of  misspecification.  Vaupel  et  al.  (1979)  provide  a  good  discus¬ 
sion  of  the  properties  of  the  gamma  model.  Hougaard  (1984)  considers  several  al¬ 
ternatives  to  the  gamma.  Hougaard  (1995)  gives  a  survey  of  heterogeneity  models. 
Heckman  and  Singer  (1984a)  advocate  nonparametric  specification  and  emphasize 
the  sensitivity  to  misspecification.  Manton  et  al.  (1986)  attempt  to  disentangle  the 
relative  importance  of  misspecifying  the  hazard  and  heterogeneity,  suggesting  that 
the  former  is  critical. 

18.3  Van  den  Berg  (2001)  provides  a  thorough  and  accessible  treatment  of  and  further 
references  on  the  identification  of  the  MPH  model. 
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18.4  Han  and  Hausman  (1990)  and  Meyer  (1990)  offer  good  empirical  examples  that  com¬ 
bine  flexible  hazard  specifications  with  parametric  assumptions  about  heterogeneity. 

18.5  The  paper  by  Heckman  and  Singer  (1984a)  is  an  early  discussion  of  the  discrete 
heterogeneity  model.  The  finite  mixture  model  of  unobserved  heterogeneity  is  also 
commonly  referred  to  as  the  “nonparametric  heterogeneity”  model.  Baker  and  Melino 
(2000)  describe  a  Monte  Carlo  study  of  duration  dependence  and  nonparametric  het¬ 
erogeneity.  They  consider  models  with  very  flexible  specification  of  duration  depen¬ 
dence  with  nonparametric  heterogeneity.  Their  results  suggest  that,  when  both  are 
present,  the  strategy  of  having  many  finite  mixture  components  in  likelihood  gener¬ 
ates  large  biases  and  unreliable  results.  Using  the  BIC  or  the  Hannan-Quinn  criterion, 
which  penalizes  overparameterization,  can  be  helpful. 

18.6  Lancaster  (1990)  and  Salant  (1977)  are  excellent  references  on  length-biased  sam¬ 
pling.  Lancaster  provides  foundational  material  on  renewal  theory  that  forms  the  ba¬ 
sis  of  several  key  results.  Also  see  Taylor  and  Karlin  (1994). 

18.7  There  are  many  papers  on  specification  testing  in  duration  models,  most  of  them 
handling  the  easier  case  of  no  censoring.  Kiefer  (1988)  provides  an  overview.  Jaggia 
(1991a)  offers  a  brief  but  clear  introduction  to  the  conditional  moment  approach  to 
specification  testing  (which  is  also  summarized  in  Greene  (2003)).  As  yet  untried 
in  the  context  of  duration  models  is  a  very  general,  but  computationally  demanding, 
approach  to  specification  testing  due  to  Andrews  (1997).  Model  selection  issues  for 
finite  mixture  models  are  discussed  in  Cameron  and  Trivedi  (1998,  chapter  6),  in  the 
context  of  count  models.  A  good  introduction  to  model  diagnostics  based  on  different 
types  of  residuals  for  duration  models  is  given  in  Hosmer  and  Lemeshow  (1999, 
pp.  196-240). 

18.8  Lancaster’s  (1979)  classic  empirical  paper  analyzes  unemployment  duration  in  the 
context  of  a  Weibull-gamma  mixture  model.  Jaggia  (1991c)  studies  misspecifica- 
tion  in  a  strike  duration  model  using  a  generalized  gamma  model  that  nests  several 
popular  specifications.  His  paper  also  highlights  the  difficulty  of  making  inferences 
from  overly  restrictive  models.  A  number  of  other  applications  of  duration  models 
are  covered  in  Chapter  19. 


- Exercises - 

18-1  (Adapted  from  Sapra,  2002)  The  analysis  of  Section  18.2  shows  the  effects 
of  unobserved  heterogeneity  on  the  unconditional  or  averaged  hazard  function. 
The  result  that  neglected  heterogeneity  leads  to  under-estimation  of  the  slope  of 
the  average  hazard  function  is  emphasized.  Let  the  conditional  hazard  function 
be  Xc(t |  v)  =  vX0(t),  where  X0  denotes  the  baseline  or  unconditional  hazard  func¬ 
tion.  Show  that  (i)  the  unconditional  hazard  Xu(t)  <  X0(t)  and  (ii)  dXu(t)/dt  <  0  in 
each  of  the  following  examples. 

(a)  v  ~  Uniform[0, 1]  and  X0(t)  =  1  V  t. 

(b)  v  follows  a  unit  exponential  distribution  with  pdf  g(v)  =  e  "  and  X0(t)  = 
pexp(yf),  p  >  0,  y  <  0. 
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18-2  Reconsider  the  Weibull-gamma  model  of  Section  18.2.3  after  replacing  the 
gamma  distributed  heterogeneity  assumption  by  the  assumption  that  hetero¬ 
geneity  is  distributed  according  to  the  log-normal  distribution  with  unit  mean. 

(a)  Verify  that  in  this  case  an  analytical  expression  for  the  unconditional  hazard 
function  is  not  obtainable. 

(b)  Substitute  the  integral  expression  for  unconditional  hazard  into  the  log- 
likelihood  given  in  Section  17.6.3.  Using  the  simulation-based  maximum 
likelihood  approach  of  Section  12.4,  describe  an  estimation  algorithm  that 
details  the  various  steps  involved  in  likelihood  maximization. 

18-3  Consider  the  exponential-gamma  mixture.  This  model  is  a  special  case  of  a 
MPH  model.  The  survivor  function,  conditional  on  a  multiplicative  heterogeneity 
factor  v ,  for  the  exponential  model  is  S(f |v)  =  exp(-/zfv),  A  >  0.  The  uncon¬ 
ditional  survivor  function  is  given  by  the  average  survivor  function.  Averaging 
is  across  the  heterogeneous  population  using  g(v),  the  density  of  v,  as  the 
weighting  function,  so  S(f)  =  /0°°  S(t\  v)g(v)dv.  Assume  that  v  is  (two-parameter) 
gamma  distributed  with  g(v)  =  8kvk~'i  exp(-Sv)/  r(k). 

(a)  Show  that,  given  gamma  heterogeneity,  S(t)  =  (1  +  [it/ 8)  k. 

(b)  Derive  expressions  for  the  unconditional  duration  density  function  f(t)  and 
the  unconditional  hazard  function  7(f).  These  general  expressions  can  be 
specialized  by  setting  the  mean  of  v  at  1 ;  that  is,  set  k  =  8,  which  leads  to  the 
exponential-gamma  mixture.  Compare  the  mean  and  variance  properties  of 
this  mixture  distribution  with  those  of  the  original  exponential  distribution. 

(c)  Suppose  that  the  random  variable  v  has  a  two-point  distribution  such  that 
with  probability  n  it  takes  the  value  v-\  and  with  probability  (1  -  n)  it  takes  the 
value  v2.  What  are  the  implications  of  this  assumption  for  the  specification 
of  the  unconditional  survivor  function?  Explain  your  answer. 

18-4  Using  the  sample  of  the  McCall  data  set  from  the  empirical  exercise  in  the  pre¬ 
vious  chapter,  reestimate  the  Weibull  model  for  those  transiting  to  full-time  em¬ 
ployment  (CENSOR1  =  1 )  under  the  assumption  that  unobserved  heterogeneity 
(also  called  frailty  in  some  computer  packages,  which  may  also  have  a  subcom¬ 
mand  for  specifying  it)  has  gamma  distribution. 

(a)  Using  generalized  residuals  as  in  Section  18.7.2  test  the  hypothesis  of 
model  misspecification. 

(b)  Does  the  new  model  display  a  duration  dependence  property?  Does  it  pro¬ 
vide  a  better  fit  to  the  data?  Explain  the  results  by  reference  to  the  interaction 
between  unobserved  heterogeneity  and  duration  dependence. 

(c)  Repeat  the  exercise  of  part  (a)  under  the  assumption  of  log-normal  het¬ 
erogeneity.  Are  the  results  about  duration  dependence  significantly  different 
from  those  for  the  gamma  heterogeneity? 
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CHAPTER  19 


Models  of  Multiple  Hazards 


19.1.  Introduction 

This  chapter  deals  with  several  different  duration  models  that  can  be  interpreted 
broadly  as  multivariate  models,  a  category  that  covers  both  parallel  and  repeated  tran¬ 
sitions.  Any  transition  model  that  involves  more  than  one  destination  state  can  be  re¬ 
garded  as  a  multivariate  model  because  the  analysis  will  involve  joint  distribution  of 
two  or  more  durations.  The  models  we  consider  arise  in  a  variety  of  ways  and  apply 
to  several  different  types  of  data.  Despite  their  differences,  they  are  grouped  in  this 
chapter  for  reasons  of  organizational  convenience. 

To  be  concrete  consider  some  examples.  A  familiar  model  from  labor  economics 
involves  a  transition  from  unemployment  to  employment  or  out  of  the  labor  force.  The 
first  transition  can  be  further  broken  into  return  to  the  old  job  or  to  a  new  job.  These 
destinations  are  mutually  exclusive.  An  unemployment  spell  may  end  by  a  transition 
to  any  one  of  the  destinations.  A  variant  of  this  example  considers  an  unemployed  in¬ 
dividual  who  could  find  either  a  new  full-time  or  part-time  job  or  remain  unemployed. 
Thus  there  are  three  possible  states  (destinations).  The  models  of  Chapters  17  and  18 
dealt  with  transitions  between  two  states.  One  can  still  use  the  two-state  methods  to 
handle  such  data.  For  example,  state  1  could  be  that  of  full-time  employment  and  state 
0  could  be  any  other  state.  This  would,  as  before,  involve  modeling  one  hazard  rate. 
However,  one  could  also  characterize  this  situation  in  terms  of  a  model  with  three 
states  and  two  transitions  and  hence  two  hazard  functions,  one  specific  to  each  desti¬ 
nation  state.  More  generally,  there  will  be  a  number  of  failure  types  and  we  may  wish 
to  model  the  transition  from  a  given  state  to  any  one  of  the  failure  types.  In  this  chapter 
we  wish  to  extend  the  conceptual  tools  developed  in  the  previous  two  chapters  to  deal 
with  multiple  hazards  (failures)  or  a  multivariate  duration  model. 

The  important  issues  are  as  follows: 

1.  How  does  one  model  the  relation  between  covariates  and  failures  of  different  types? 

2.  How  does  one  model  interaction  between  failure  types  under  a  specific  set  of  study 

conditions? 


640 


19.1.  INTRODUCTION 


3.  How  does  one  estimate  failure  rates  for  certain  types  of  failures  given  the  “removal”  of 

some  or  all  other  failure  types? 

A  multivariate  duration  model  involves  simultaneous  modeling  of  all  transitions, 
that  is,  joint  specification  and  estimation  of  two  or  more  hazard  functions.  There  are 
several  possible  frameworks  for  analyzing  multivariate  duration  data;  the  competing 
risks  framework  is  one  of  the  most  popular.  McCall  (1996)  provides  an  empirical 
application  of  the  competing  risks  framework  to  unemployment  data  with  focus  on 
the  role  of  unemployment  insurance.  Using  an  approach  similar  to  McCall’s,  Deng, 
Quigley,  and  Van  Order  (2000)  study  the  transitions  of  mortgage  holders  to  the  states 
of  prepayment  or  termination  of  mortgages. 

What  is  the  motivation  for  and  the  gains  from  joint  modeling  of  hazards?  If  the 
different  hazards  are  essentially  independent  then  separate  and  joint  modeling  will 
produce  the  same  results.  However,  different  hazards  may  be  linked;  for  example,  there 
may  be  present  a  common  unobserved  heterogeneity  term  in  each  hazard  function. 
Alternatively,  each  hazard  may  include  an  unobserved  heterogeneity  term  with  one  or 
more  common  shared  components,  leading  to  correlated  hazards. 

A  second  class  of  examples  involves  a  case  of  parallel  events  in  which  one  ana¬ 
lyzes  the  joint  distribution  of  durations  to  destinations.  For  example,  the  pair  (7j,  7j) 
could  be  the  duration  of  unemployment  and  duration  without  health  insurance.  Here 
the  motivation  for  joint  estimation  of  the  hazards  could  be  similar  to  that  previously 
outlined. 

A  third  example  involves  joint  distribution  of  lengths  of  repeat  spells  in  the  same 
state  (e.g.,  unemployment,  or  without  health  insurance).  That  is,  for  a  given  individual, 
one  wants  to  simultaneously  model  the  hazards  of  terminating  a  spell.  If  the  spells  in 
question  are  independent,  then  they  can  be  analyzed  by  single-spell  methods  of  earlier 
chapters.  If  the  researcher  wants  to  study  the  dependence  structure  of  the  transitions, 
then  joint  modeling  of  spells  in  a  given  state  is  appropriate.  New  models  and  methods 
are  called  for  when  the  spells  are  dependent.  This  last  example  is  potentially  more 
complex  than  the  preceding  ones  because  of  possible  dependence  between  events  sep¬ 
arated  by  time  intervals.  For  example,  the  length  and  type  of  a  previous  spell,  or  more 
generally  the  past  history  of  spells,  may  affect  the  probability  and  length  of  a  succeed¬ 
ing  spell;  or  the  unobserved  characteristics  of  the  individual  may  persist  over  succes¬ 
sive  spells.  Such  serially  correlated  unobserved  heterogeneity  creates  a  link  between 
repeat  spells.  Even  the  occurrence  probability  of  an  event  may  depend  on  previous 
occurrence  of  the  same  event.  Heckman  and  Borjas  (1980)  characterize  several  struc¬ 
tural  types  of  state  dependence  for  an  individual  using  concepts  such  as  occurrence 
dependence  and  (Markovian)  lagged  duration  dependence. 

Corresponding  to  these  different  data  situations  are  a  variety  of  models  in  the  liter¬ 
ature.  However,  though  they  might  appear  to  be  a  disparate  selection  they  are  linked 
by  several  common  threads.  After  introducing  the  basic  concepts,  in  Section  19.2  we 
examine  the  popular  competing  risks  model.  In  Section  19.3  we  consider  a  multivari¬ 
ate  model  based  on  marginal  distributions  of  a  set  of  survival  times  and  introduce  the 
copula  approach  to  joint  modeling  of  survival  times.  Multiple-spell  modeling  is  con¬ 
sidered  in  Section  19.4. 
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19.2.  Competing  Risks 

First,  we  introduce  some  concepts  that  are  used  to  in  the  competing  risks  model 
(CRM)  and  in  other  multivariate  formulations.  Often  these  are  extensions  of  concepts 
already  introduced  in  Chapter  17.  The  basic  CRM  formulation  is  applicable  to  mod¬ 
eling  time  in  one  state  when  exit  is  to  a  number  of  competing  states,  such  as  different 
causes  of  death.  The  CRM  is  attractive  because  it  is  relatively  straightforward  to  im¬ 
plement  if  the  model  is  a  PH  model. 


19.2.1.  Basic  Concepts 

We  now  consider  the  CRM  in  which  there  are  m  latent  duration  or  failure  times,  one 
for  each  distinct  competing  cause  of  failure. 


Latent  Durations 

The  setup  of  the  model  is  as  follows.  Each  subject  has  an  underlying  failure  time, 
which  is  subject  to  censoring.  Failure  time  may  be  one  of  m  different  types,  given  by 
the  set  J  =  {1 , ,m}.  We  may  think  of  this  as  a  situation  with  m  distinct  causes  of 
transition  from  a  given  state  (“death”).  However,  the  occurrence  of  a  failure  of  one 
kind  of  event  removes  the  individual  from  risks  of  other  kinds  of  events.  Therefore, 
given  censoring  of  the  remaining  (m  —  1 )  durations  for  each  individual,  we  observe  at 
most  one  complete  duration. 

In  a  CRM  with  m  types  of  failures,  there  are  m  +  1  states  {0,  1, . . . ,  m } ,  where 
0  represents  the  initial  state  and  {1 , ,m]  are  possible  destination  states.  For  the 
;th  individual  the  data  vector  is  of  the  form  (x,  ,  f,-,  dn, . . . ,  dmj,  d„),  where  x,  is  a 
vector  of  weakly  exogenous  covariates  that  measure  the  characteristics  of  i,  tj  = 
min  Hi,  , . . . ,  ?„„■ ,  fC() ,  where  4,  denotes  the  time  to  transition  to  the  fcth  destination, 

tCj  denotes  the  time  of  censoring,  and  dji  =  1  (4,  =  f,) ,  j  =  I . m,  c  arc  dummy 

variables  that  take  the  value  one  if  tj,  =  tj.  Because  we  only  observe  one  of  the  4, ,  the 
remaining  are  interpreted  as  latent  variables. 

Censoring  may  be  regarded  as  a  competing  risk.  It  operates  on  individuals  according 
to  a  probability  distribution.  In  this  chapter  the  censoring  variable  is  assumed  to  be 
independent  of  the  (4, . . . ,  tm). 

Unobserved  characteristics  of  i  are  subsumed  under  unobserved  heterogeneity,  de¬ 
noted  as  v.  If  v  varies  with  cause  of  exit,  then  we  write  it  as  Vj ,  j  =  1 , . . . ,  m. 


Competing  Causes 

A  standard  example  of  competing  risks  is  death  from  competing  causes.  Consider 
an  individual  who  has  had  a  kidney  transplant  operation  and  is  “at  risk”  of  transit¬ 
ing  to  the  healthy  state,  or  to  rejection,  or  to  some  other  unhealthy  condition  such 
as  a  liver  complication.  Succumbing  to  any  one  condition  means  that  transition  to 
other  states  is  not  possible.  So  in  an  m- event  setup,  each  event  provides  one  complete 
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duration  and  m  —  1  censored  durations.  Thus  we  have  a  situation  of  “competing 
risks”  in  which  there  is  competition  to  determine  the  transplant  patient’s  destination 
state. 

Although  discrete-time  models  are  often  required  in  empirical  applications,  our  ex¬ 
position  of  the  joint  hazard  formulation  uses  the  continuous-time  framework  and  gen¬ 
erally  follows  the  exposition  given  in  Mealli  and  Pudney  (1996).  We  also  assume  that 
we  have  single-spell  data. 

The  model  provides  the  joint  distribution  of  the  spell  duration,  denoted  r,  and 
the  exit  route  r,  which  is  an  integer  variable  that  takes  one  of  the  values  in  the  set 
(1,2 - -  m). 

We  ignore  censoring  for  simplicity  and  assume  that  there  exist  latent  variables 
(fi, . . . ,  tm),  one  for  each  destination,  that  correspond  to  the  spell  duration  for  each 
possible  exit  route  by  which  the  spell  may  terminate  if  there  were  no  other  risk  factors 
that  might  cause  the  spell  to  end  sooner.  Destination-specific  covariates  are  denoted  by 
Xj  (/  =  1, . . . ,  m).  We  observe  one  duration,  r,  where 

x  =  min  (fi - -  tm)  (19.1) 

=  m/n  (tj)  .  h  >  0, 

at  the  termination  of  the  spell;  that  is,  only  the  shortest  duration  is  observed  and  the 
rest  are  censored.  Censoring  owing  to  factors  other  than  exit  are  not  considered.  Then 

Pr  [r  >  t]  =  Pr  [fi  >  t,  . . . ,  tm  >  t ]  (19.2) 

=  ST(t), 


which  is  the  joint  survivor  function.  If  the  risks  are  independent  then 

Pr  [r  >  t]  =  Pr  [A  >  t]  x  Pr  [ ?2  >  t]  x  •  •  •  x  Pr  [tm  >  f]  .  (19.3) 


The  corresponding  exit  route  r  is  given  by 

r  —  argmin  (tj)  . 


(19.4) 


Let  gj(t)dt  denote  the  probability  of  succumbing  to  risk  j  in  the  interval  (t.  t  +  dt): 
then  the  total  hazard  rate  applicable  to  all  causes  is 


XT(t)  =  —d/dt  In  ST(t)  =  ^  gj(t ). 

i= i 

In  biostatistics  this  is  referred  to  as  the  total  force  of  mortality  (David  and 
Moeschberger,  1978).  If  risks  are  independent,  then  the  hazard  rate  for  a  specific  cause 
j  is  X  j(t)  =  gj(t).  This  means  that  probability  of  failure  from  cause  j  in  (t.  t  +  dt), 
conditional  on  survival  to  t,  is  the  same  whether  j  is  one  of  the  risks  or  the  only  risk. 
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The  probability  of  surviving  the  risk  j  in  the  interval  (7j ,  7?)  conditional  on  surviv¬ 
ing  to  7j  is 


or  equivalently 


pTj  r  T2  pT\ 

/  Xj(t)dt—  /  Xj(t)dt—  /  Xj(t)dt 

Jt 1  Jo  Jo 


=  In  S(T2)  -  In  S(7j ) 


=  -  In 


Mo  >  T2] 

Pr [tj  >  h]' 


(19.5) 


(  rn  \  Pr[i,  >r,l 

exp(-/ri  ^  md,)= 4j^ti  (19-6) 

One  minus  the  left-hand  side  expression  is  referred  to  as  the  net  probability  of  death 
from  cause  j  in  the  interval  (T\ ,  Tf) .  The  expression  in  (19.6)  is  useful  for  building  up 
the  likelihood  function  for  estimation. 


Independent  Risks 


We  can  now  explicitly  bring  into  the  picture  covariates  that  affect  the  hazard  rate.  We 
assume  independent  risks  (as  opposed  to  correlated  risks)  and  consider  the  distribu¬ 
tion  of  tj .  The  hazard  rate  for  failure  of  j  th  type  is  defined  by 


lim 

J  1  1  Af m 


Pr[?;  <  T  <  tj  +  At,  | T  >  tj,  Xj] 
At ; 


and  the  integrated  hazard  A j(tj  x; )  for  the  jth  type  risk  is  defined  by 


Aj(tj\Xj)=  Xj(s\xj)ds. 
Jo 

Then  the  duration  density  is 


fj(tj\\j.  (ij)  =  Xj(tj\Xj,f3j)Sj(tj\xj,  (3j), 

=  >-/(/;  Xj.  )  e\p[  A  ft  I  Xj.  .');)1. 

using  the  relation  between  survivor  and  integrated  hazard  functions.  Defining  x  = 
[xi, . . . ,  xm  J'  and  f3  =  [/3, , . . . ,  (3mf  gives  the  joint  density  of  r  and  r: 

fj{r,r\x,(3)  =  fr(r\xr,  /3r)]~[exp[-A;(r|x7,  /3; )]  (19.7) 

j^r 

=  K  (t|x,,  @r)  exp[— A,.(r  |x,.,  f3,.)] 

x  ]~[  exp[— A7(r|x7-,  (3j)] 

Mr 

m 

=  K  (T|xr,/3r)  ]~ [  exp[— A j(r\Xj ,  /3 j)]. 

7=1 

The  first  line  follows  from  the  product  of  conditional  and  marginal  probabilities.  The 
second  term  on  the  right-hand  side  is  the  product  of  survival  probabilities  for  all  exit 
routes  other  than  r,  which  uses  the  independence  of  risks  assumption. 
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Equation  (19.7)  implies  that 


Xj  (r\Xj,(3j  )  exp 


X]  -Aj(r\Xj,/3j) 


(19.8) 


lj=l  J 

=  (t| *j,Pj)  exp  [— Aa(r  |x,  f3)\ , 

where  Aa(r[x,  (3)  =  Y^j=\  A_/(r|Xy,  /3j)  is  the  aggregate  or  overall  integrated  hazard. 
This  last  equation  simply  says  that  the  total  hazard  of  leaving  the  origin  state  is  the 
sum  of  hazards  for  all  destinations.  The  overall  survivor  function  is 


S(t)  =  exp  (— Aa(t))  .  (19.9) 

The  likelihood  function  given  independent  risks  is  the  product  over  all  observations 
of  terms  like  (19.7).  This  likelihood  can  be  written  explicitly  if  all  relevant  functional 
forms  are  specified.  Many  issues  that  were  previously  relevant,  such  as  flexibility  of 
functional  form,  unobserved  heterogeneity,  and  so  forth,  remain  relevant  in  the  context 
of  CRM.  Instead  of  keeping  the  discussion  at  a  general  level,  we  now  consider  specific 
functional  forms.  The  proportional  hazard  specification  is  popular  in  the  literature  and 
will  be  used  here. 


19.2.2.  CRM  with  Proportional  Hazards 

The  goal  here  is  to  derive  the  joint  density  of  spell  length  and  reason  for  exit,  and  this 
can  be  done  by  aggregating  the  integrated  hazard  over  reasons  for  exit. 

Consider  PH  models  of  the  form 


Xj(t;x)  =  X0j(t)exp[x'(t)f3j\,  j  =  1, . . . ,  m. 


where  both  the  baseline  hazard  Xqj  and  /3  •  are  specific  to  type  j  hazard,  and  tji  < 
•  •  •  <  tjk  denote  the  kj  ordered  failures  of  type  j .  For  example,  if  m  =  2,  then  k  \  refers 
to  the  number  of  individuals  who  registered  a  failure  of  type  1 ,  and  ko  to  the  number 
of  individuals  who  registered  a  failure  of  type  2. 

The  likelihood  function  for  the  Cox  CRM  given  is  then 


UA . A»)  =  flfl 


exp  [x'..(t7/)/3,] 


= i f=  i 


(19.10) 


m 


=nw. 

7=1 


where 


L(/3.)=n  gxp|x"(,;f,:iil 

1  1  U  'EieR(tjl)zxp[x,i(tjl)Pj] 


(19.11) 


Notice  the  following  four  features  of  this  likelihood:  (1)  L  j(j3j)  is  the  partial  like¬ 
lihood  developed  in  Section  17.8.2.  The  baseline  hazard  function  is  absent,  and  the 
asymptotic  distribution  results  stated  previously  also  apply.  (2)  L(/3j, . . . ,  (3m)  can  be 
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jointly  maximized  by  maximizing  each  individual  factor  L;-(/3  •)>  given  the  indepen¬ 
dence  of  risks;  hence  joint  and  separate  maximizations  are  equivalent.  Estimation  and 
comparison  of  the  /3  -s  can  be  made  by  applying  standard  asymptotic  techniques  to 
each  individual  factor  in  the  m-term  likelihood.  (3)  The  ideas  of  Sections  17.7  and  17.8 
can  be  extended  directly.  If  a  discrete-time  (dummy  variable)  formulation  is  used  for 
each  type  of  hazard,  then  the  identifiable  components  of  the  hazard  function  can  be  es¬ 
timated  for  each  type  of  hazard  jointly  with  the  f3j.  (4)  Unobserved  heterogeneity  can 
be  introduced  exactly  as  in  the  single-spell,  two-state  proportional  hazards  model  in 
Chapter  18. 


19.2.3.  Identification  of  CRM 

Cox  (1962a)  and  Tsiatsis  (1975)  showed  that  when  the  CRM  has  no  covariates,  the 
model  is  not  identified.  More  precisely,  this  means  that  any  CRM  with  dependent  risks 
is  observationally  equivalent  to  a  CRM  with  independent  risks.  However,  Heckman 
and  Honore  (1989)  showed  that  under  certain  assumptions  a  CRM  that  has  the  mixed 
PH  form  with  covariates  is  identified.  Van  den  Berg  (2001,  pp.  3438-3441)  provides  an 
exposition  of  the  underlying  assumptions.  Assumptions  additional  to  those  discussed 
in  Chapter  17  are  needed.  For  example,  the  covariates  must  show  “sufficient  variation” 
and  should  not  be  perfectly  collinear.  We  also  require  that  the  baseline  hazards  for 
different  risks  should  not  be  perfectly  related. 


19.2.4.  Interpretation  of  Regression  Coefficients 


In  the  proportional  hazards  type  formulation  of  CRM,  the  impact  of  a  change  in  a 
covariate  on  the  hazard  rate  for  transition  from  a  given  state  is  analogoue  to  the  PH 
model  in  Chapter  17,  but  the  direct  interpretation  of  regression  coefficients  faces  an 
interpretation  problem  similar  to  that  discussed  for  the  multinomial  logit  in  Section 
15.4.3. 

However,  one  may  also  be  interested  in  the  impact  of  change  in  a  covariate  on  the 
probability  of  exit  via  route  r.  This  is  harder  to  calculate.  To  see  this  note  that  the 
expression  for  the  probability  of  exiting  a  given  state  via  route  r  is  given  by 


Pr  [r  |  r,  x,  /3] 


K  (r|xr,  (3r) 

£7=i  ^j(r\xj,(3jY 


(19.12) 


Because  covariates  appear  in  both  the  numerator  and  the  denominator,  and  more¬ 
over  the  denominator  is  the  sum  of  all  hazards,  the  sign  of  the  partial  derivative 
3  Pr  [r  |  r ,  x,  /3]  /dxrk  depends  on  all  the  parameters  in  the  model.  It  is  then  not  true 
that  the  sign  of  /3rk  is  also  the  sign  of  the  partial.  (The  situation  is  exactly  analogous 
to  that  discussed  in  Chapter  15  on  multinomial  models.)  However,  the  following  result 
is  available  if  the  competing  risk  is  of  the  proportional  hazard  type  (Thomas,  1996, 
p.  31).  If  prk  >  Pjk,  V/  7^  r,  then  the  sign  of  3  Pr  [r|r.  x,  ft\  /dxrk  is  positive.  In 
words,  an  increase  in  %k  will  increase  the  conditional  probability  of  exit  via  route  r 
if  its  estimated  coefficient  in  ).,-(■)  is  larger  than  the  corresponding  coefficients  in  all 
other  hazard  functions. 
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19.2.5.  CRM  with  Unobserved  Heterogeneity 


If  the  competing  risks  are  of  the  proportional  hazards  type,  then  the  methods  of  the  pre¬ 
vious  chapter  can  be  extended  to  include  unobserved  heterogeneity.  A  general  specifi¬ 
cation  of  unobserved  heterogeneity  allows  for  a  state-specific  random  component.  Let 
v  =  (vi . . .  vm)  be  the  vector  of  unobserved  multiplicative  heterogeneity  terms  that  are 
assumed  to  have  a  joint  distribution  function  G{v)\  then, 


fj( T  r|x,  (3,  v)  =  Xj(r\Xj,f3j,  iy)ex p 


^-AjWXj.Pj.Vj) 


U=i 


=  ;v,r  /)i7 


exp 


X  A.(r  x;. 


L/=i 


where  the  second  line  follows  from  assumption  of  multiplicative  heterogeneity. 

This  is  an  example  of  a  competing  risks  model  with  state-specific  random  effects. 
The  distribution  marginal  with  respect  to  u  is  obtained  by  integrating  out  u. 


fj(r,r\x,  (3)  = 


-/-/ 


^■j(j\xj,  Pj)Vj  exp 


X  A  ?(t  x;  .  .1,  n-? 


L  7  =  1 


dG(  u). 


which  involves  an  m-fold  integral. 

A  manageable  case  is  one  in  which  the  m  elements  of  v  are  independent  gamma 
distributed  random  variables.  In  this  case  the  m-fold  integral  decomposes  into  a  prod¬ 
uct  of  m  integrals.  An  example  is  the  case  in  which  we  have  a  Weibull-gamma 
mixture  for  each  cause-specific  hazard  function.  In  this  case  the  competing  risks  are 
independent. 

If  we  allow  the  elements  of  u  to  be  correlated,  then  we  get  a  more  interesting  case  in 
which  the  competing  risks  are  dependent.  Indeed,  this  is  a  very  widely  used  “trick”  for 
generating  dependence  among  competing  risks.  Specifically,  suppose  we  have  a  mul¬ 
tivariate  log-normal  distribution  for  u,  that  is,  [In  v\  . . .  In  vm  ]'  ~  A/"[0,  £].  This  has 
two  consequences.  First,  it  induces  dependence  in  the  competing  risks  through  hetero¬ 
geneity;  second,  it  makes  computation  of  maximum  likelihood  estimates  considerably 
more  difficult.  The  reason  for  the  latter  is  that  the  m-fold  integral  does  not  have  an 
analytic  expression.  Consequently,  Monte  Carlo  integration  will  have  to  be  used.  If 
m  equals  two  or  three  as  in  many  applied  examples,  this  is  still  manageable  but  far 
from  trivial.  To  reduce  the  dimensionality  of  the  integral  it  may  be  useful  to  restrict 
the  structure  of  the  covariance  matrix.  For  example,  we  may  use  a  factor  structure  in 
which  each  term  Vj  may  be  specified  to  be  a  linear  function  of  (say)  two  iid  random 
variables,  with  unknown  weights  (factor  loadings).  For  identifiability,  normalization 
restrictions  on  the  weight  coefficients  may  be  necessary. 


19.2.6.  CRM  with  Dependent  Competing  Risks 

The  independent  CRM  has  an  important  computational  advantage  over  the  model  in 
which  dependence  is  induced  through  heterogeneity  variables  correlated  across  com¬ 
peting  hazards.  However,  the  latter  yields  valuable  additional  information  about  the 


647 


MODELS  OF  MULTIPLE  HAZARDS 


structure  of  heterogeneity,  such  as  the  association  parameter(s).  Nonetheless,  there  re¬ 
mains  the  practical  issue  of  how  restrictive  a  specification  of  correlated  heterogeneity 
one  should  choose.  For  exposition  let  us  view  the  problem  in  a  bivariate  regression-like 
setting  using  the  following  setup  similar  to  that  in  (17.20): 


In 


In 


[/ 

[/ 


X\(u)du 

X2(u)du 


=  -X^-  Vi  +  £, 

=  — x'/32—  v2  +  s. 


Now  we  could  assume  v\  =  V2  =  v,  that  is,  exactly  the  same  unobserved  heterogene¬ 
ity  term  in  both  hazard  models.  The  assumption  is  that  the  same  unobserved  factors 
affect  both  spells  but  their  impact  may  differ.  This  amounts  to  perfectly  correlated 
heterogeneity  across  the  two  hazards.  Less  restrictively,  we  could  assume  that,  for  ex¬ 
ample,  U|  and  u2  are  correlated  and  estimate  an  association  parameter.  We  can  think  of 
these  as  one-  and  two-factor  models  of  heterogeneity,  respectively.  Whether  the  more 
restrictive  approach  is  empirically  desirable  depends  on  the  context.  For  example,  if 
the  two  hazards  pertain  to  the  same  individual,  and  we  think  of  V\  and  u2  as  reflecting 
individual-specific  factors,  then  the  one-factor  model  has  justification.  If,  however,  we 
think  of  the  two  factors  as  hazard- specific,  then  the  two-factor  model  is  more  appeal¬ 
ing.  There  is  some  theoretical  and  Monte  Carlo  evidence  that  the  use  of  the  one-factor 
model  when  the  two-factor  model  is  the  correct  specification  causes  significant  distor¬ 
tions  (Lindeboom  and  Van  den  Berg,  1994). 


19.3.  Joint  Duration  Distributions 

In  this  section  we  consider  the  case  of  nonmutually  exclusive  or  parallel  spells  that  are 
dependent.  Survival  times  are  assumed  to  be  continuous.  The  exposition  is  at  a  general 
level  and  for  simplicity  it  is  restricted  to  the  case  where  the  spells  are  not  censored  and 
have  parametric  distributions. 

In  applied  work  on  jointly  distributed  survival  times  a  natural  starting  point  would 
be  a  particular  functional  form  for  the  joint  survival  or  the  joint  density  function  that 
may  be  used.  Are  there  some  “standard”  functional  forms  available?  Or  is  there  a 
general  method  for  generating  the  multivariate  counterparts  of  the  models  considered 
in  the  previous  chapters?  We  consider  these  issues  in  the  following. 


19.3.1.  Extending  Survival  Concepts  to  a  Multivariate  Setting 

It  is  helpful  to  begin  by  extending  the  definitions  and  concepts  of  the  two  previous 
chapters  to  the  multivariate  case. 

A  multivariate  survival  function  S(t)  is  defined  by 

S(!t)  =  S(tu...,tq)  (19.13) 

=  Pr  [C  >  tu...,Tq  >  tq\. 
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where  T\, ...  ,Tq  are  q  survival  times  with  univariate  survival  functions  S,  (r7) .  By 
definition, 

Sj(tj)  =  Pr[Tj>tj]  (19.14) 

=  S(Tt  Tj  >  tj,...,Tq  >  0) 

=  S{0,...,tj,...,0). 

Unlike  the  case  of  the  univariate  survival  function 


S(h,  ...,tq)^  1  —  F(t  1,  . . . ,  tq). 


For  example,  S(tx ,  t2)  =  1  -  F(ti)  -  F(t2)  +  F(tx ,  t2). 

The  joint  density  of  (t\ , . . . ,  tq)  is  denoted  by  f(t\ , . . . ,  tq)\  if  F(t\ ,  . . . ,  tq)  is  con¬ 
tinuous  then 


=  (-!)« 


9«F(fl,...,rg) 

9t! . . .  9t? 


(19.15) 


Analogous  to  the  univariate  case  the  joint  hazard  function  is  X  [t\, . . . ,  tq)  and  is 
defined  by 


X(t\, . . . ,  tq)  — 


.fit i,  ...,tq) 

s(tu...,tqy 


(19.16) 


The  joint  integrated  hazard  A (ti, . . . ,  tq)  is  the  <j-fold  integral  of  X(t\ , . . . ,  tq).  How¬ 
ever,  there  is  no  simple  relationship  between  A (t\, . . . ,  tq)  and  S(t\ ,  . . . ,  tq)  analogous 
to  the  univariate  case. 

Given  these  definitions,  is  it  possible  to  derive  joint  survival  functions?  Clayton  and 
Cuzick  (1985)  consider  a  bivariate  model  that  illustrates  the  definitions  given  here.  The 
starting  point  in  their  analysis  is  an  assumption  about  the  “cross-hazard  ratio”  func¬ 
tion,  defined  as  a  function  of  two  conditional  hazard  functions  of  t\ ,  given  T2  =  t2  and 
T2  >  t2.  This  leads  to  a  nonlinear,  second-order  partial  differential  equation  whose 
solution  generates  a  joint  survival  function  in  which  the  cross-hazard  ratio  function 
plays  an  important  role.  We  refer  to  the  original  sources  for  detail  but  note  that  this  ap¬ 
proach  requires  assumptions  that  may  be  difficult  to  extend  beyond  dimension  higher 
than  two. 


19.3.2.  Bivariate  Distributions  Based  on  Marginals 

This  section  briefly  reviews  some  approaches  for  generating  bivariate  duration  models. 
The  approach  builds  on  assumptions  about  marginal  survival  functions.  This  may  be 
useful  if  the  researcher  has  a  good  feel  for  such  marginal  distributions  and  wants  to  use 
them  as  building  blocks.  Of  course,  choice  of  the  building  blocks  places  restrictions 
on  the  form  of  the  resulting  joint  distribution. 

One  approach,  which  is  due  to  Marshall  and  Olkin  (1990),  considers  a  model  with 
multiplicative  unobserved  heterogeneity  in  the  marginal  distributions  of  both  failure 
times  in  the  following  way.  Let  /)  (fijx,  ,  i>),  i  =  1,  2,  denote  the  marginal  distributions 
of  t\ ,  t2,  given  covariates  X] .  xo;  here  v  is  the  common  unobserved  heterogeneity  term 
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in  the  two  marginals  and  is  the  source  of  association  between  the  two  hazards.  In 
survival  analysis  such  a  model  might  be  referred  to  as  “shared  frailty”  model;  it  is  the 
(only)  source  of  correlation  between  1 1  and  to-  Assume  that  v,  v  >  0,  has  probability 
distribution  with  density  g(v).  The  bivariate  distribution  of  t\,  to  is  formally  defined  as 

poo 

|xi,x2)=/  fl{tl\xl,v)f2(t2\x2,v)g(v)dv,  (19.17) 

Jo 

where  distribution  parameters  are  suppressed  for  notational  simplicity. 

This  bivariate  distribution  generated  as  a  mixture  may  or  may  not  have  a  closed- 
form  solution,  so  without  a  specific  parametric  specification  one  cannot  say  whether 
the  result  will  be  computationally  convenient  to  use.  It  is  also  the  case  that  the  resulting 
bivariate  distribution  will  restrict  the  correlation  between  t\  and  t2  to  be  positive.  In 
some  cases  this  may  not  be  desirable. 

This  general  approach,  applicable  to  any  type  of  data,  can  be  specialized  to  the 
present  case  by  replacing  the  marginal  distributions  with  marginal  survivor  functions 
and  deriving  the  joint  survivor  function  by  integrating  out  the  variable  v;  thus, 

pOO 

S(h,  *2|xi,x2)  =  /  Si(tl\x1,v)S2(t2\x2,v)g(v)dv.  (19.18) 

J  o 

An  example  of  the  application  of  this  idea  is  provided  by  Clayton  and  Cuzick  (1985), 
who  use  such  a  formulation  to  obtain  a  bivariate  survivor  function  under  the  assump¬ 
tion  of  marginal  proportional  hazards  with  gamma  heterogeneity. 

As  illustrated  this  approach  for  generating  bivariate  survivor  model  is  somewhat  re¬ 
strictive.  One  source  of  restriction  is  the  assumption  of  one-factor  unobserved  hetero¬ 
geneity.  In  principle  this  restriction  is  easy  to  remove.  For  example,  we  could  replace  v 
by  (vj  vo) .  U|  >  0.  Vo  >  0,  which  represents  a  vector  of  two  correlated  elements,  one 
specific  to  each  survivor  function,  with  a  joint  probability  distribution  g(iq,  v2).  Then 

/•OO  pOO 

S(tut2\xux2)=  /  /  ^(filxj,  vl)S2(t2\x2,  v2)g(vi,  v2)dvldv2.  (19.19) 

Jo  Jo 

For  concreteness  suppose  that 

Vi  =  ft>n£i  +  U>i2S2, 

V2  =  C02\E\  +  (022&2-I 

sj  j  =  1,2, 

where  {cojj,  i,  7  =  1,2}  are  unknown  parameters,  frequently  referred  to  as  “factor 
loadings.”  This  says  that  heterogeneity  components  (vj,  v2)  are  correlated  linear 
combinations  of  iid  random  components  S\  and  s2  if  factor  loadings  are  not  zero. 
Other  popular  assumptions  in  empirical  work  are  (i)  that  (ln£i,ln£2)  have  a  stan¬ 
dard  bivariate  normal  distribution  or  (ii)  that  v\,  v2  have  a  discrete  (finite-mixture) 
distribution.  So  the  model  (19.19)  has  a  bivariate  mixture  form.  Additional  identi¬ 
fying  restrictions  (e.g.,  the  normalization  a>u  =  1)  are  necessary  also.  The  Pearson 
correlation  coefficient  between  v\  and  v2,  Cov[ i'| ,  vo\/  [  V[ y i  ] V [  V2 1  ] 1  2,  depends  on 
{coij,  crj,  (,7  =  1,2}  and  it  is  straightforward  to  verify  that  here  this  quantity  would 
not  have  the  usual  —1  and  +1  as  the  lower  and  upper  bounds.  (Also  note  that  the 
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corresponding  association  parameter  for  failure  times  is  Cov[ti,  f2]/  [  V[ r i  |V[ri]|l/2, 
which  is  distinct  from  that  given.)  Van  den  Berg  (1997)  derives  sharp  bounds  on 
Cor[/j,  t2  |x  | ,  specifically  — 1/3  <  Corflj,  f2  |x]  <  1/2,  for  a  mixed  proportional  haz¬ 
ard  model  with  constant  baseline  hazard,  and  shows  that  these  bounds  do  not  depend 
on  the  covariates  x  nor  on  the  distribution  of  heterogeneity.  If  baseline  hazard  is  not 
constant,  the  correlation  bounds  also  depends  on  it. 

The  factor  loading  specification  has  computational  advantages  relative  to  that  in 
which  the  unobserved  heterogeneity  components  enter  in  an  unrestricted  manner.  Al¬ 
though  a  one-factor  model  is  likely  to  be  too  restrictive,  an  unrestricted  model  gives 
rise  to  a  potentially  high  dimensional  integral.  From  a  computational  viewpoint,  the  re¬ 
sulting  distribution  may  or  may  not  be  easy  to  handle,  depending  in  part  on  whether  or 
not  the  integration  produces  a  closed-form  expression  for  the  joint  survivor  function. 
If  it  does  not,  a  simulation-based  approach  will  be  needed  for  estimation.  At  present 
estimation  of  such  a  model  would  require  going  beyond  standard  packages. 

The  factor  loading  specification  does  place  restrictions  on  the  model  (Van  den  Berg, 
2001 ;  Lindenboom  and  Van  den  Berg,  1994).  For  example,  if  one  of  the  marginal  mod¬ 
els  does  not  indicate  the  presence  of  unobserved  heterogeneity,  then  Cov[  v  \ ,  v2 1  must 
be  zero;  if  V[ui]  >  0  and  V[v2]  >  0,  then  Cov[vi,  v2]  /  0.  Hence  if  Cov[vi,  v2]  =  0, 
then  one  of  the  marginals  has  no  unobserved  heterogeneity. 

From  an  applied  perspective  an  attractive  multivariate  survivor  function  should  be 
flexible.  The  approach  just  outlined  has  some  limitations.  There  are  alternative  ap¬ 
proaches  that  have  been  proposed.  One  such  approach  that  holds  some  promise  is  the 
use  of  copula  functions.  Hougaard  (2000,  pp.  435-437)  provides  an  introduction  in  the 
context  of  survival  analysis. 


19.3.3.  The  Copula  Approach 

Copulas,  originally  introduced  by  Sklar  in  a  1959  article  in  French  (see  also  Sklar, 
1973),  have  been  suggested  as  a  useful  method  for  deriving  joint  distributions  given  the 
marginals,  especially  when  one  wants  to  work  with  nonnormal  distributions.  Although 
we  introduce  this  idea  in  the  context  of  joint  survival  models,  where  it  has  found  ready 
applications,  it  can  also  be  used  to  study  the  joint  distributions  of  any  set  of  discrete, 
continuous,  or  mixed  discrete/continuous  variables. 

The  approaches  already  discussed  (e.g.,  the  Marshall-Olkin  method)  generate 
dependence  between  variables  through  unobserved  heterogeneity  components.  This 
seems  attractive  in  most  applications  because  it  is  impossible  for  observed  covariates 
to  cover  all  relevant  aspects  of  an  economic  event. 


Properties  of  Copulas 

To  define  a  copula  we  begin  with  possibly  dependent  uniform  random  variables 
U\ ,  Uq  on  the  [0.  1]  interval.  The  dependence  relationship  is  described  through 
their  joint  cdf 

C  (m  1 ,  .  .  .  ,  Uq)  —  Pr  [t/l  <  Ml,  .  .  .  ,Uq  <  Uq\  ,  (19.20) 
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where  the  function  C(-)  is  the  copula,  and  uj  is  a  particular  realization  of  Uj,  j  = 
1 

The  right-hand  side  is  the  joint  cdf,  F(-),  and  the  q  arguments  of  the  copula  can  be 
replaced  by  q  marginal  cdfs  Tj(-),  ■  ■  ■ ,  Fq (•)•  That  is, 

C  (F,  (m)  ,...,Fq  (uq))  =  F  («!,  ...,uq) 

defines  a  joint  cdf.  With  a  copula-based  construction  of  a  joint  cdf  we  select  a  set  of 
marginals  and  combine  them  to  generate  a  joint  cdf.  A  given  copula  is  a  functional 
form  for  combining  selected  marginals  and  different  choices  of  C(-)  lead  to  different 
joint  cdfs.  Sklar’s  Theorem  established  that  any  multivariate  distribution  function 
can  be  written  in  the  form  (19.20)  and  that  given  continuous  marginals  the  copula 
representation  is  unique. 

As  specialized  to  a  multivariate  survival  function,  Sklar’s  Theorem  says  that  a  q- 
dimensional  multivariate  survival  function  5(0, . . . ,  tq)  has  a  corresponding  copula 
representation  C(5i(0), . . . ,  Sq(tq)). 

Consider  the  case  q  =2.  Then, 

F(h,  t2)  =  Pr[7j  <  tu  T2  <  t2\ 

=  1  -  Pr[7i  >  0]  -  Pr  [T2  >  t2\  +  Pr[7i  >  t\,  T2  >  t2\ 

and 


S(t\,  t2)  =  Pr  [7i  >  t\ ,  T2  >  t2\ 

=  \  -  F(tx)  -  F{t2)  +  F{h,t2) 

=  Si(h)  +  S2(t2)  -  1  +  C  (1  -  Srdi),  1  -  S2(h)) , 

where  C(-)  is  called  the  survival  copula.  Notice  now  that  5(0 ,  t2)  is  now  a  function  of 
the  marginal  survival  functions  only. 

Copulas  have  a  certain  symmetry  property  that  allows  one  to  work  with  copulas 
or  survival  copulas  (Nelsen,  1999).  Joe  (1997)  defines  a  bivariate  copula  associated 
with  F(-),  denoted  by  C'(u,  v),  as  a  two-dimensional  probability  distribution  function 
defined  on  the  unit  square  [0,  l]2 ,  with  univariate  marginals  uniform  on  [0,  1] .  For  all 
(u,  u)  e  [0.  1] ,  C(«,  0)  =  C(0,  w)  =  0,  C(m,  1)  =  u ,  and  C(l,  v )  =  v.  In  the  context 
of  survival  copulas  we  replace  u  by  the  marginal  survivor  function  5(q )  and  v  by  the 
second  marginal  survivor  function  S(t2).  In  this  notation  Sklar’s  Theorem  states  that 
there  exists  a  copula  function  C  such  that 

F(u,v)  =  C(Fu(u),Fv(v)),  (19.21) 

where  F(u,  v )  =  Pr[t/  <  u.  V  <  u]  is  a  bivariate  distribution  function  of  random 
variables  U  and  V,  and  Fu(u)  and  Fv(v)  denote  the  marginal  distribution  functions. 

If  F  is  continuous,  and  if  the  univariate  marginals  have  corresponding  quantile  func¬ 
tions  Fu  1  and  F~l,  then  the  unique  copula  in  Equation  (19.21)  can  be  expressed  as 

C(ui,u2)=F(F-1(u),F-1(v)). 

The  copula  approach  involves  specifying  marginal  distributions  of  each  random 
variable  along  with  a  function  (copula)  that  binds  them  together.  The  copula  function 
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can  be  parameterized  to  include  measures  of  dependence  between  the  marginal  distri¬ 
butions.  If  no  dependence  is  detected,  the  two  marginals  are  independent,  and  estima¬ 
tion  can  be  performed  on  each  variable  separately.  However,  if  dependence  is  present, 
improved  estimates  may  be  obtained  by  recovering  a  joint  distribution  by  way  of  a 
copula  function.  Since  a  copula  can  capture  dependence  structures  regardless  of  the 
form  of  the  margins,  a  copula  approach  to  modeling  related  variables  is  potentially 
very  useful  to  econometricians.  Frechet  bounds  make  it  possible  to  study  the  extent 
of  dependence  permitted  by  any  copula.  Despite  apparent  differences  we  see  that  the 
mixture  approach  of  Section  19.3.2  for  deriving  the  bivariate  survival  function  leading 
to  (19.19)  is  fundamentally  similar  to  that  based  on  the  copula  approach  as  both  begin 
with  marginals. 

We  now  consider  an  example  with  q  durations  (7j, . . . ,  Tq)  that  are  conditionally 
independent  given  common  neglected  unobserved  heterogeneity  v;  covariates  are  ex¬ 
cluded  for  simplicity.  Then  the  conditional  joint  survivor  function  is 

Pr  [Ti  >  t\ ,  •  •  • ,  Tq  >  tq  |  v]  =  Pr  [T\  >  t\  |  v]  x  . . .  x  Pr  [Tq  >  tq  \  v] 

=  Si  [(h)|  v] . .  .Sq  [(tq)\  v] 

and  the  multivariate  survival  function  is  defined  as 

Pr  [7j  >tu...,Tq>  tg]  =  E„  [SiftOI  n, . . . ,  v] .  (19.22) 


Measuring  Dependence 

The  functional  form  of  the  copulas  itself  does  not  depend  on  the  form  of  the  univariate 
margins.  Copulas  are  usually  specified  with  parameters  that  generate  a  measure  of  the 
dependence  between  the  univariate  margins.  Usually  dependence  is  parameterized  as 
a  scalar  measure.  Here  we  concentrate  on  bivariate  copulas  for  simplicity. 

The  copula  representation  for  discrete  random  variables  is  not  necessarily  unique 
(Joe,  1997,  p.  14).  This  is  not  a  major  problem  in  practical  application  where  the  con¬ 
cern  is  to  approximate  the  unknown  joint  distribution.  The  key  modeling  issue  is  to 
choose  a  sufficiently  flexible  parametric  form  for  the  copula  function. 

The  dependence  parameters  from  copulas  can  be  difficult  to  interpret  because  they 
are  not  necessarily  in  the  [0,  1]  interval.  Therefore,  it  is  customary  to  convert  the  de¬ 
pendence  parameter  to  a  familiar  measure  of  association  such  as  Kendall’s  tau  or 
Spearman’s  rho:  see  Joe  (1997).  Schweizer  and  Wolff  (1981)  showed  that  Spear¬ 
man’s  correlation  coefficient  can  be  expressed  solely  in  terms  of  the  copula  function; 
thus, 


p(h,t2)  =  12 


{ C  ( u ,  v )  —  ni>)  dudv. 


Consider  any  bivariate  joint  cdf  F(t\ ,  t2)  with  univariate  marginal  cdfs  F\  (t\ )  and 
Fiiti).  By  definition,  0  <  F\  (7| ),  F2(t2)  <  1,  because  each  marginal  distribution  takes 
a  value  in  the  range  [0 ,  1].  The  joint  cdf  is  bounded  below  and  above  by  the  Frechet 
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Table  19.1.  Some  Standard  Copula  Functions 


Copula  Type 

Function  C(n,  v ) 

0-Domain 

Product 

u  V 

nafl 

FGMS* 

uv{  1+61(1  -  k)(1  -  v)) 

-1  <  6  <  +1 

Normal6 

<$>[<$>  '(wjO-Hi;);  6»] 

-i  <  e  <  +i 

Clayton 

(i u~ 6  +  v~e  -  l)"1/e 

9  e  (0,  oo ) 

Frank 

-0-1  In (r)  -  (1  -  e~6u)(l  -  e~ev))/ih 
r)  =  1  —  e~e 

9  e  (—oo,  oo) 

a  na,  not  applicable. 
b  Farlie-Gumble-Morgenstern  copula. 
c  $  denotes  bivariate  normal  cdf. 


lower  and  upper  bounds,  F  and  F+,  defined  as 


F(h,  t2)  >  F~(ti,  t2)  =  max[Fi(fi)  +  F2(t2 )  -1,0], 

F(t\ ,  t2)  <  F+(t\,  t2)  =  min[/fi(fi),  F(t2\. 

Since  copulas  are  joint  cdfs,  they  are  also  subject  to  the  Frechet  bounds.  Knowledge 
of  Frechet  bounds  is  important  in  selecting  an  appropriate  copula.  Every  copula  places 
bounds  on  permissible  values  for  its  dependence  parameter  9.  A  desirable  feature  of 
a  bivariate  copula  is  that  as  6  approaches  the  lower  (upper)  bound  of  its  permissible 
range,  the  copula  approaches  the  Frechet  lower  (upper)  bound.  However,  the  paramet¬ 
ric  form  of  a  copula  may  impose  restrictions  such  that  one  or  both  Frechet  bounds  are 
not  included  in  the  pennissible  range.  Therefore,  a  particular  copula  may  be  a  better 
choice  for  one  data  set  than  for  another. 


Examples 

Table  19.1  gives  examples  of  some  bivariate  copula  functions  that  have  been  used  in 
the  literature.  Joe  (1997)  discusses  the  properties  of  these  copulas. 

The  Normal  and  the  Frank  copulas  include  both  Frechet  bounds  in  their  permissible 
ranges.  The  Clayton  copula  belongs  to  the  Archimedean  family,  with  the  representa¬ 
tion  C  ( u ,  v)  =  0(0  :(1  —  u)  +  0-1(l  —  t>));  see  Smith  (2003). 

Suppose  we  want  to  choose  the  Clayton  copula  to  model  the  bivariate  survival  times 
(t\  t2).  Then  the  bivariate  distribution,  expressed  in  terms  of  marginal  survival  models 
S(0)  and  S(t2),  will  be 

(S(f1)-e  +  5(f2re-  ir1/e. 

We  assume  that  the  marginal  survival  functions  are  specified  up  to  unknown  parame¬ 
ters.  As  before  these  marginal  survival  functions  can  be  written  to  capture  dependence 
on  covariates  and  unobserved  heterogeneity.  For  example,  these  could  be  based  on  the 
proportional  hazards  model.  For  estimation  we  can  apply  maximum  likelihood  based 
on  the  resulting  bivariate  copula. 
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This  approach  is  not  without  limitations.  Two  in  particular  are  noteworthy.  First, 
extension  to  three  or  more  dimensions  is  not  trivial.  Second,  one  needs  not  only  to 
choose  a  particular  functional  form  for  the  copula  but  also  to  be  aware  of  its  potential 
restrictiveness  in  capturing  dependence  for  a  given  data  set.  For  example,  only  positive 
correlation  may  be  supported. 


Likelihoods  Derived  from  Copulas 

To  fit  a  model  derived  from  a  copula  (defined  in  terms  of  the  cdfs)  the  first  step  is 
to  select  a  copula  and  the  second  is  to  derive  the  likelihood  (defined  in  terms  of  the 
pdfs)  from  it.  Having  chosen  a  copula  consider  the  derivation  of  the  likelihood  for  the 
special  case  of  a  bivariate  model  with  uncensored  failure  times  (t\ ,  t2).  Define  fj(tj)  = 
dFj(tj)/dtj  and  Cj(Fu  F2)/dtj  for  j  =  1,2,  define  Cn(Fu  F2 )  =  dC(Fu  F2)/dtidt2. 
Then  the  probability  density 

f(h,  h)  =  Mh)Mt2)Cn  (Jifa),  F2(t2)) ,  (19.23) 

where  f(t\ ,  t2)  =  d2F(t i,  t2)/dt\dt2,  is  used  to  construct  the  likelihood  function.  If 
censored  observations  are  present  in  the  data,  the  likelihood  must  be  appropriately 
modified  (Frees  and  Valdez,  1998,  pp.  15-16;  Georges  et  al.,  2001). 

Using  different  copulas  generates  nonnested  models.  As  in  other  similar  instances, 
penalized  log-likelihood  values  can  be  used  to  choose  among  them. 


19.4.  Multiple  Spells 

A  distinction  between  parallel  states  and  recurrent  states,  introduced  early  in  this  chap¬ 
ter,  is  helpful.  Parallel  states  involve  parallel  events  such  as  being  employed  and  having 
health  insurance;  recurrent  states  involve  sequential  events  such  as  the  first  birth,  the 
second  birth,  and  so  forth.  The  term  multiple  spells  refer  to  the  durations  between  re¬ 
current  spells  of  the  same  event.  Joint  modeling  of  such  data  has  similarities  with  joint 
modeling  of  parallel  states  as  both  involve  multivariate  concepts,  but  there  are  also  im¬ 
portant  differences  because  sequential  events  may  generate  dynamic  dependence  in 
hazards. 

Consider  some  examples  of  recurrent  events.  Individuals  in  the  labor  market 
may  experience  a  succession  of  transitions  between  employment  and  unemployment. 
Young  workers,  for  example,  may  record  a  succession  of  spells  of  unemployment. 
Newman  and  McCulloch  (1984)  consider  the  timing  of  births  within  a  hazard  frame¬ 
work.  If  one  wants  to  model  the  hazard  rate  for  each  birth  in  a  series  of  births,  con¬ 
sideration  must  be  given  to  the  correlation  between  interbirth  durations.  Trivedi  and 
Alexander  (1989)  analyze  multiple  spells  of  youth  unemployment  in  Australia.  In  the 
literature  on  fertility,  the  duration  between  successive  births  is  of  interest  (Heckman, 
Hotz,  and  Walker,  1985).  Mealli  and  Pudney  (1996)  analyze  the  positive  association 
between  the  duration  in  employment  and  pensionable  status  using  data  from  a  retire¬ 
ment  survey  in  the  United  Kingdom.  Engle  and  Russell  (1998)  study  the  time  series 
of  durations  between  successive  transactions  of  a  particular  stock  traded  on  the  stock 


655 


MODELS  OF  MULTIPLE  HAZARDS 


market.  Stevens  (1999)  analyzes  the  persistence  of  poverty  over  individuals’  lifetimes 
taking  account  of  multiple  spells  of  poverty. 

The  aforementioned  examples  have  several  noteworthy  features.  Whether  the  haz¬ 
ard  rate  of  an  event  depends  on  a  previous  event,  conditional  on  a  previous  event,  is  an 
important  modeling  issue.  Second,  the  form  of  dependence  is  of  interest.  The  duration 
of  a  previous  spell  may  enter  as  a  covariate  in  determining  the  hazard  of  a  later  event; 
the  occurrence  of  a  previous  event  may  affect  the  baseline  hazard  for  a  later  spell;  and, 
finally,  unobserved  heterogeneity  may  show  serial  dependence.  Each  of  these  raises  an 
important  modeling  issue. 

Multiple  spells  generate  longitudinal  or  panel  data  that  can  potentially  help  to  re¬ 
solve  the  important  identification  issue  concerning  the  influence  of  dynamic  depen¬ 
dence  (“the  hand  of  past”)  relative  to  that  of  heterogeneity  in  the  hazard  function.  Un¬ 
der  some  assumptions  multiple  observations  make  it  easier  to  control  for  heterogeneity 
and  to  make  inferences  about  dynamic  dependence. 

In  general,  survival  models  with  unobserved  heterogeneity  and  dependence  between 
spells  can  be  expected  to  be  difficult  to  estimate.  However,  multiple-spell  data  create 
opportunities  to  study  issues  that  can  be  studied  only  if  panel  data  are  available.  Oc¬ 
currence  dependence,  lagged  duration  dependence,  and  serially  correlated  unobserved 
heterogeneity  are  examples.  Both  lagged  duration  and  occurrence  dependence  refer  to 
dependence  of  the  termination  probability  of  the  spell  in  progress  on  either  the  number 
or  the  duration  of  previous  spells.  Given  such  dependence,  it  is  not  appropriate  to  study 
spells  individually,  ignoring  their  interdependence. 

In  considering  the  choice  of  a  suitable  econometric  framework  for  multiple  spells, 
one  possibility  is  to  model  dependence  using  joint  survival  functions,  as  discussed 
in  the  preceding  section.  This  approach  takes  care  of  the  multivariate  nature  of  the 
data.  A  second  possibility  is  to  use  the  panel  data  framework  with  the  time  subscript 
replaced  by  the  spell  subscript,  without  ignoring  the  possibility  that  calender  time  still 
may  have  relevance.  Spell  dependence  introduces  issues  that  will  be  discussed  under 
the  topic  of  dynamic  panel  models  in  Sections  22.5  and  23.6.  In  both  these  cases  an 
important  difference  arises  from  the  possibility  of  censoring  because  of  panel  attrition 
or  because  the  most  recent  spell  is  incomplete. 


19.4.1.  A  Model  with  Two  Spells 

A  proportional  hazards  model  with  two  spells  can  illustrate  a  number  of  features  of 
multiple-spell  models.  In  econometrics  such  models  have  been  analyzed  by  Honore 
(1993)  and  Horowitz  and  Lee  (2003). 

Honore  (1993)  considers  a  proportional  hazards  model  of  the  form 

ks(r|x,  v)  =k0,s(t)d>  (x,  (3)  v,  s  =  l,2.  (19.24) 

Note  that  in  this  specification  the  baseline  hazard  is  spell-specific,  but  the  heterogeneity 
component,  which  enters  multiplicatively  (a  key  assumption),  is  not;  that  is,  v  repre¬ 
sents  the  fixed  or  permanent  characteristics  of  an  individual,  and  hence  we  have  a  fixed 
effects  model.  Under  conditions  similar  to  those  for  the  mixed  PH  discussed  in  Chap¬ 
ter  18,  he  shows  that  the  model  is  identified.  He  also  shows  that  neither  the  assumptions 
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about  the  distribution  of  v  nor  the  presence  of  the  covariates  is  essential  for 
identification. 

In  a  second  model  Honore  considers  spell-specific  multiplicative  heterogeneity 
components  v\  and  v2,  with  a  joint  bivariate  pdf  g(vl7  v2).  The  correlation  between  v\ 
and  v>2  could  reflect  serially  correlated  heterogeneity.  This  is  a  random  effects  model. 
The  joint  survival  function  S(t\,  ?2  lx)  is  derived  by  the  bivariate  mixing  approach  as 
shown  in  (19.19)  using  the  mixing  distribution  g(ui,  v2).  If  the  marginal  survival  func¬ 
tions  are  identified,  then  the  joint  survival  function  is  also  identified.  The  identification 
conditions  are  essentially  those  for  identifi ability  of  the  PH  model. 

Honore  also  considers  the  lagged  duration  dependence  specification  of  the 
second-spell  model  under  the  assumption  that  the  duration  of  the  first  spell,  denoted 
1 1 ,  enters  the  hazard  for  the  second-spell  multiplicatively.  He  provides  sufficient  condi¬ 
tions  for  identifiability  of  the  parameters  in  the  second-spell  conditional  model,  given 
covariates  and  t\ .  These  conditions  are  not  discussed  here.  However,  under  these  con¬ 
ditions,  a  multiple-spells  version  of  the  proportional  hazards  model  has  the  form 

Ai(fi|xi,  vi)  =  ^o,t(O0(xi,  Pi)vi,  (19.25) 

k2fe|x2,  v2)  =  X0i(t)<p(xd2,  (32)v2, 


where  x|  =  (x2,  t\ )  is  the  augmented  vector  of  covariates.  Note  that  there  is  an  endo¬ 
geneity  problem  here  if  v\  and  v2  are  correlated  since,  in  that  case,  t\  and  v2  cannot  be 
independent. 

The  previous  occurrence  of  a  spell  may  not  simply  shift  the  hazard  function  in  the 
succeeding  spell.  It  may  also  alter  the  specification  of  the  hazard  by  bringing  in  new 
covariates.  For  example,  an  unemployment  spell  may  induce  enrollment  into  a  training 
program,  which  plausibly  could  impact  the  hazard  of  a  later  spell  of  unemployment. 
If  the  training  variable  were  treated  as  weakly  exogenous,  identification  of  the  model 
would  be  under  threat.  This  point  is  relevant  even  for  the  analysis  of  a  single-spell 
model:  The  assumption  that  covariates  and  unobserved  heterogeneity  are  uncorrelated 
is  not  innocuous. 

In  some  cases  it  may  be  desirable  to  model  not  only  multiple  spells  in  one  state  but 
also  those  in  other  related  states.  For  example,  there  may  be  two  states,  employed  or 
not  employed,  and  we  may  be  interested  in  not  just  how  length  of  last  unemployment 
spell  affects  the  length  of  current  unemployment  spell  but  also  in  the  effect  of  the 
intervening  employment  spell  on  the  hazard  out  of  unemployment.  Further,  we  might 
observe  data  on  individuals  when  they  are  in  one  state  but  not  another.  For  example, 
administrative  data  may  cover  people  when  on  welfare  but  not  when  off  welfare. 


19.4.2.  A  More  General  Model  of  Multiple  Spells 

To  illustrate  the  potential  computational  complexity  of  multiple-spell  models,  we  de¬ 
scribe  briefly  the  model  of  Mealli  and  Pudney  (1996). 

Let  r  =(x\, ...  ,Tk)  denote  the  k-dimensional  vector  of  complete  spells,  r^-x  the 
index  of  origin  state,  and  the  index  of  destination  state.  Assume  independence  of 
durations  across  spells  after  controlling  for  possible  lagged  duration  dependence.  Let 
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A-  ji'X-j,  j3j )  denote  the  destination-specific  hazard  function,  and  let  x  =  [x1; . . . ,  x*], 

(3  =  [(3l,...,(3k\. 

The  joint  density  of  spells  and  exit  routes  is  given  by 

f(ruru  r2,r2,...,rk\xu...,xk,  r0,  (3)  (19.26) 

=  f(ri,n\xi,  r0; (3) . . .  /( rk_u  r^lxH,  r0,  rx . rk_2,  (3 ) 

xS(zk \xk,  r0,  ru  ....  rk_x,(3) 
k-l  /  k 

=  ]”[  Kj (tj  | Xj ,  f3r. )  exp  I  -  Ao  (T I x/ ,  (3) 

7=1  V 

where  it  has  been  assumed  that  the  kth  spell  is  censored  (in  progress)  and  we  use 
relationships  (17.4)  and  (17.6).  The  covariates  include  some  that  vary  across  spells 
and  possibly  lagged  durations.  This  formulation  may  be  compared  with  the  single¬ 
spell  CRM  formulation  (19.7). 

Mealli  and  Pudney  (1996)  build  an  elaborate  model  using  this  formulation  as 
the  basis.  Because  they  allow  for  unobserved  heterogeneity  with  even  more  com¬ 
plex  structure  than  that  considered  in  this  chapter,  their  computational  procedure  is 
also  more  complicated.  They  use  the  method  of  simulated  maximum  likelihood  (see 
Section  12.4). 


19.5.  Competing  Risks  Example:  Unemployment  Duration 

The  duration  examples  used  in  Chapters  17  and  18  focused  on  the  time  in  an  unem¬ 
ployment  spell,  ignoring  the  destination  state  after  transition.  Here  we  implement  a 
competing  risk  analysis  of  the  same  data  used  in  McCall  (1996).  The  data  disdnguish 
three  different  destination  states:  full-time  employment  in  the  first  postdisplacement 
job,  part-time  employment  in  the  first  postdisplacement  job,  and  either  full-time  or 
part-time  status  in  the  first  postdisplacement  job  the  employee  had  left  by  the  time  of 
the  survey.  One  can  therefore  relax  the  assumption  that  the  hazard  function  does  not 
depend  on  the  destination  state  and  consider  instead  the  competing  risks  formulation 
in  which  independent  competing  risks  determine  the  duration  of  unemployment. 

For  the  McCall  data  set  there  are  1073,  339,  and  574  transitions,  respectively,  to 
each  of  the  three  states  mentioned.  The  third  destination  state  lacks  a  clear  interpre¬ 
tation,  so  the  results  for  that  case  are  not  discussed  in  detail.  For  each  transition  we 
estimated  four  parametric  duration  models,  exponential  and  Weibull,  with  and  without 
inverse-Gaussian  heterogeneity.  Gamma  heterogeneity  was  also  considered  but  this 
model  was  computationally  unstable.  Because  of  the  assumption  of  independent  com¬ 
peting  risks,  estimation  can  be  carried  out  one  equation  at  a  time.  Selected  extracts 
of  the  computer  output,  with  focus  only  on  a  limited  number  of  variables  as  in  Chap¬ 
ters  17  and  18,  are  given  in  Tables  19.2  and  19.3. 
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Table  19.2.  Unemployment  Duration:  Competing  and  Independent  Risk  Estimates  of 
Exponential  Model  with  and  without  IG  Frailty 


Risk 

Coefficient 

Transitions 

No  Heterogeneity 

IG  Heterogeneity 

Risk  1 
1,073 

Risk  2 

339 

Risk  3 

574 

Risk  1 
1,073 

Risk  2 

339 

Risk  3 
574 

RR 

.472 

-.092 

-.600 

.504 

-.185 

-.562 

(.601) 

(.976) 

(.725) 

(.614) 

(1.025) 

(.744) 

DR 

-.575 

-.959 

1.122 

-.806 

-1.051 

1.078 

(.762) 

(1.247) 

(.901) 

(.781) 

(1.295) 

(.921) 

UI 

-1.424 

-1.047 

-.966 

-1.544 

-1.092 

-.963 

(.249) 

(.524) 

(.449) 

(.258) 

(.544) 

(.456) 

RRUI 

.966 

-.669 

-.432 

1.057 

-.742 

-.482 

(.612) 

(1.192) 

(1.014) 

(.627) 

(1.23) 

(1.033) 

DRUI 

-.198 

1.987 

2.102 

-.012 

2.18 

2.158 

(1.019) 

(1.727) 

(1.303) 

(1.041) 

(1.788) 

(1.323) 

LNWAGE 

.351 

-.257 

.003 

.373 

-.321 

-.007 

(.116) 

(-179) 

(.145) 

(.118) 

(-191) 

(.147) 

TENURE 

0 

.005 

-.047 

.0006 

.007 

-.047 

(.006) 

(.013) 

(.012) 

(.007) 

(.014) 

(.012) 

—In  L 

5,693.63 

5,687.64 

19.5.1.  Estimates  under  Competing  Risks  Framework 

Pairwise  comparison  of  exponential  models  with  and  without  heterogeneity  shows  an 
improvement  in  the  log-likelihood  results  from  the  introduction  of  unobserved  het¬ 
erogeneity.  This  result  is  similar  to  the  pattern  reported  in  Section  18.8.  However,  the 
Weibull  model  without  heterogeneity  has  a  significantly  higher  log-likelihood  than  the 
exponential  model,  —5,666  against  —5,693.  The  Weibull  model  with  inverse-Gaussian 
heterogeneity  has  the  highest  log-likelihood,  —5,543,  and  seems  to  be  the  best  of  the 
four  models.  This  should  not  be  interpreted  to  mean  that  it  is  a  satisfactory  model 
for  inference  -  that  issue  remains  open.  Henceforth  we  shall  discuss  the  results  in 
Table  19.3. 

Introduction  of  unobserved  heterogeneity  in  the  Weibull  model  leads  to  a  substantial 
increase  in  estimate  of  the  hazard  function  slope  coefficient  in  all  three  hazard  func¬ 
tions.  This  coefficient  increases  from  1.29  to  1.75  for  risk  1,  and  from  1.08  to  1.65  for 
risk  2.  That  is,  the  introduction  of  unobserved  heterogeneity  leads  to  a  stronger  indica¬ 
tion  of  decreasing  duration  dependence  or  steeply  rising  hazard  out  of  unemployment. 
These  changes  are  along  the  lines  predicted  by  the  analysis  of  Section  18.5.  In  the 
Weibull  model  the  impact  of  adding  unobserved  heterogeneity  on  the  coefficient  of 
unemployment  insurance  (UI)  is  also  quite  substantial,  becoming  substantially  larger 
in  absolute  magnitude.  The  coefficients  of  RR,  DR,  RRUI,  and  DRUI  remain  impre¬ 
cisely  determined.  The  coefficient  of  LNWAGE  is  significant  and  positive  in  the  first 
hazard  function,  but  not  in  the  second.  That  is,  the  increase  in  LNWAGE  accelerates 
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Table  19.3.  Unemployment  Duration:  Competing  and  Independent  Risk  Estimates  of  Weibull  Model  with  and 
without  IG  Frailty 
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1.29  1.08  1.17  1.75  1.65  1.79 

(.022)  (.033)  (.028)  (.04)  (.06)  (.048) 

5,666.13  5,543.33 


19.5.  COMPETING  RISKS  EXAMPLE:  UNEMPLOYMENT  DURATION 


Baseline  Survival  Functions 


Figure  19.1:  Unemployment  duration:  estimated  baseline  survival  functions  from  the  Cox 
Competing  Risks  model.  U.S.  data  from  1986-92  on  3343  spells,  some  incomplete. 


transition  out  of  unemployment  of  those  seeking  full-time  employment  but  has  a  neg¬ 
ligible  impact  on  those  who  transit  to  part-time  employment.  This  exemplifies  how  the 
competing  risks  framework  may  allow  us  to  distinguish  between  the  different  role  of  a 
variable  in  different  hazard  functions. 

Also  consider  the  Cox  model  specification  of  the  competing  risks  model  given  in 
Section  19.2.  In  this  specification  unobserved  heterogeneity  is  ignored  and  the  base¬ 
line  hazard  is  not  parametrically  specified,  but  it  can  be  estimated  as  explained  in 
Section  17.8.3.  The  point  estimates,  comparable  to  those  for  the  exponential  model 
in  Table  19.2,  are  given  in  the  last  three  columns  of  Table  19.3,  but  the  standard  er¬ 
rors  are  much  larger,  as  the  Cox  specification  is  less  restrictive  than  the  exponential. 
The  estimated  coefficient  of  unemployment  insurance  is  closer  to  that  in  the  exponen¬ 
tial  model  than  to  that  in  the  Weibull-IG  model;  the  latter  is  almost  twice  as  large. 
The  LNWAGE  coefficient  is  also  larger  in  the  Weibull-IG  model.  However,  given  that 
unobserved  heterogeneity  is  ignored,  identification  of  the  baseline  hazard  is  not  possi¬ 
ble.  Figures  19.1  and  19.2  show,  respectively,  the  computed  baseline  survival  functions 
and  the  cumulated  hazard  functions  for  the  three  destinations,  but  these  are  better  inter¬ 
preted  as  reflecting  some  unknown  mixture  of  unobserved  heterogeneity  and  duration 
dependence.  These  estimates  show  that  the  baseline  survival  function  for  those  tran¬ 
siting  to  full-time  employment  is  the  lowest  and  lies  below  the  other  two,  and  that  for 
those  transiting  to  part-time  employment  it  is  the  flattest  and  the  highest.  Correspond¬ 
ingly,  the  cumulated  hazard  function  for  those  transiting  to  full-time  employment  is 
the  steepest  of  the  three. 

The  discussion  and  analysis  presented  here  is  only  illustrative,  not  final  in  any  sense. 
Indeed,  there  remain  good  reasons  to  suggest  that  the  Weibull  hazard  function  is  a  mis- 
specification.  McCall’s  (1996)  analysis  of  the  same  data  set  allows  for  a  more  flexible 
polynomial  hazard  function  and  comes  up  with  evidence  supporting  a  bathtub-shaped 
hazard,  which  implies  decreasing  hazard  at  low  durations,  then  fairly  constant  and 
eventually  rising  hazard  at  high  durations.  The  monotonic  Weibull  hazard  func¬ 
tion  does  not  capture  this  possibility.  The  experience  of  other  researchers  modeling 
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Baseline  Cumulative  Hazard  Functions 
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Figure  19.2:  Unemployment  duration:  estimated  baseline  cumulative  hazards  from  the  Cox 
Competing  Risks  model.  Same  data  as  Figure  19.1. 


unemployment  duration  using  the  U.S.  data  has  revealed  that  when  the  hazard  func¬ 
tion  is  flexibly  specified,  the  introduction  of  unobserved  heterogeneity  does  not  have  a 
large  impact  on  the  results  (Meyer,  1990;  Han  and  Hausman,  1990).  The  fact  that  we 
do  not  see  that  result  here  should  motivate  the  use  of  a  more  flexible  specification  such 
as  the  one  analyzed  in  Section  17.10. 


19.6.  Practical  Considerations 

In  modeling  multivariate  survival  models  it  is  practical  to  begin  with  marginal  models 
before  undertaking  simultaneous  estimation.  Such  a  strategy  can  be  helpful  in  assess¬ 
ing  the  statistical  adequacy  of  the  initial  specification. 

At  the  time  of  this  writing,  the  statistical  implementation  of  multivariate  survival 
and  hazard  models  will  in  most  cases  require  one’s  own  programming,  a  task  that  can 
be  partially  eased  by  the  use  of  supporting  software  such  as  optimization  programs  for 
maximization  or  minimization  of  user-defined  functions  using  functions  and  program¬ 
ming  language  offered  by  many  programs  and  programming  platforms. 

The  CRM  with  independent  risks  reduces  to  estimation  of  a  series  of  survival  mod¬ 
els  for  which  practical  use  information  was  given  in  Section  17.12.  Programs  for  gen¬ 
eral  multivariate  CRM  are  not  easy  to  find  in  commercial  software.  Some  multivariate 
survival  models  with  special  dependence  structure  are  supported.  For  example,  STATA 
supports  computation  of  the  shared  frailty  model.  A  shared  frailty  model  is  a  random 
effects  model  where  the  components  of  unobserved  heterogeneity  are  common  to,  or 
shared  among,  groups  of  individuals  or  spells  and  are  randomly  distributed  across 
groups. 

If  the  main  interest  is  in  modeling  the  dependence  structure  among  durations,  the 
copula  approach,  because  it  does  not  require  numerical  integration,  is  potentially  at¬ 
tractive  relative  to  maximum  simulated  likelihood  for  the  bivariate  case.  For  dimen¬ 
sions  higher  than  two,  as  in  the  case  of  multiple-spell  models,  it  is  feasible  but  there 
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are  relatively  few  examples  in  the  published  literature.  Marginal  models  can  be  fitted 
and  tested  using  standard  univariate  survival  models,  and  the  dependence  parameter 
can  be  estimated  in  a  sequential  second-stage  procedure.  Even  if  all  parameters  are 
to  be  estimated  simultaneously  the  estimated  marginal  models  provide  a  useful  set  of 
starting  values  for  the  iterative  computation.  We  are  unaware  of  statistical  software 
that  supports  the  estimation  of  these  models. 

19.7.  Bibliographic  Notes 

19.2  Han  and  Hausman  (1990)  give  an  empirical  example  of  CRM  in  which  the  specification  is 
generalized  to  allow  for  unobserved  heterogeneity.  Within  the  framework  of  the  CRM  with 
state-specific  random  effects,  McCall  (1996)  analyzes  the  impact  of  some  policy  variables 
on  the  behavior  of  the  insured  unemployed  seeking  part-time  work  using  the  CRM  model 
with  correlated  risks.  In  Butler,  Anderson,  and  Burkhauser  (1989)  the  hazards  of  accepting 
a  job  and  of  dying  are  modeled  using  a  CRM  with  correlated  risks. 

19.3  Sklar’s  pioneering  article  on  copulas  appeared  in  1959  in  French,  but  Sklar  (1973)  is  a 
good  substitute  in  English.  Radulovic  and  Wegkamp  (undated)  provide  a  proof  of  Sklar’s 
Theorem.  A  very  helpful  guided  tour  of  the  copula  literature  with  an  annotated  bibliogra¬ 
phy  is  given  by  Frees  and  Valdez  (1998). 

19.4  Multiple  spells  are  studied  by  Mealli  and  Pudney  (1996)  and  by  Flinn  and  Heckman 
(1982).  Mealli  and  Pudney  (1996)  analyze  transitions  among  pensionable  jobs,  nonpen- 
sionable  jobs,  and  other  labor  market  states  using  simulation-based  estimation  methods. 


- Exercises - 

19-1  (Adapted  from  Sapra,  2000;  2001).  This  problem  involves  an  example  that  illus¬ 
trates  the  Cox-Tsiatsis  nonidentification  of  the  competing  risks  result  mentioned 
in  Section  19.2.  Consider  the  following  dependent  competing  risks  model  in 
which  we  observe  T  =  min(7j,  T2)  and  <5,  where  S  =  1  if  T  =  7j,  and  S  =  2  if 
T  =  T2.  Here  7j  and  T2  are  latent  durations  of  risks  1  and  2,  respectively.  Sup¬ 
pose  that  the  bivariate  joint  survivor  function  is  S(h,  t2)  =  exp[— (Mb  +  k2f2)“], 
0  <  a  <  1 ,  ,  X2  >  0.  Construct  an  independent  CRM  that  is  equivalent  to  the 

specified  dependent  competing  risks  model. 

19-2  For  the  model  specified  in  the  preceding  problem,  write  down  the  log-likelihood 
function  for  each  model  in  terms  of  hazard  rates  and  integrated  hazard  rates,  if 
both  T  and  S  are  observed.  Examine  the  information  matrix  of  the  parameters, 
and  show  that  all  the  parameters  are  locally  identified  because  it  is  nonsingular. 

19-3  Consider  two  parallel  durations,  say  duration  of  unemployment,  7 j,  and  the  du¬ 
ration  of  a  spell  without  private  health  insurance,  T2.  Assume  that  conditional 
on  unobserved  heterogeneity  the  durations  are  independent  and  exponentially 
distributed  with  means  +  fax  and  y0  +  y^x,  respectively.  Suppose  that  multi¬ 
plicative  unobserved  heterogeneity  terms  for  the  two  duration  models  are  v-i  and 
v2,  with  E[v-|]  =  E[v2]  =  1  ■ 

(a)  For  parameter  values  of  your  choice,  write  an  algorithm  to  generate  cor¬ 
related  realizations  for  (vi,v2)  such  that  unconditionally  on  (v1,v2),  but 
conditionally  on  x,  the  two  durations  will  be  correlated.  You  are  free  to 
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make  distributional  assumptions  for  the  joint  distribution  of  (vi,  v2)  that  are 
appealing  on  grounds  of  mathematical  convenience  or  other  considera¬ 
tions.  Explain  how  you  can  control  the  extent  of  correlation  between  the 
two  durations. 

(b)  Using  the  technique  for  obtaining  a  bivariate  joint  distribution  given  in 
Section  19.3.2,  derive  the  joint  distribution  of  durations. 

(c)  Describe  how  you  might  extend  the  analysis  of  part  (b)  to  allow  for  the  pres¬ 
ence  of  right-censored  durations. 

19-4  Using  the  same  subsample  of  the  McCall  data  set  as  in  Chapter  18,  estimate 
using  a  two-state  model  with  unemployment  and  employment  as  the  two  states, 
(i.e. ,  ignoring  the  distinction  between  part-time  and  full-time  employment  as  two 
alternative  destinations). 

(a)  Fit  the  single-equation  Weibull  model  and  compare  the  results  with  those  for 
independent  CRM  with  the  Weibull  specification. 

(b)  Evaluate  the  improvement  in  goodness  of  fit  resulting  from  the  CRM  speci¬ 
fication. 

(c)  Evaluate  and  compare  the  fitted  values  of  the  hazard  out  of  unemployment, 
evaluated  at  sample  averages  of  the  explanatory  variables,  from  the  single 
equation  and  the  CRM  models. 


664 


CHAPTER  20 


Models  of  Count  Data 


20.1.  Introduction 

In  many  economic  contexts  the  dependent  or  response  variable  of  interest  is  a  non¬ 
negative  integer  or  count  that  we  wish  to  explain  or  analyze  in  terms  of  a  set  of  re¬ 
gressors.  Unlike  the  classical  regression  model,  the  response  variable  is  discrete,  with 
a  distribution  that  places  probability  mass  at  nonnegative  integer  values  only.  Several 
models  discussed  earlier  in  the  book,  such  as  the  binary  outcome  model  and  the  du¬ 
ration  model,  can  be  shown  to  be  closely  related  to  the  count  data  regression  model. 
Regression  models  for  counts,  like  other  limited  or  discrete  dependent  variable  models 
such  as  the  logit  and  probit,  are  nonlinear  with  many  properties  and  special  features 
intimately  connected  to  discreteness  and  nonlinearity. 

Let  us  consider  some  examples  from  microeconometrics,  beginning  with  sample 
data  that  are  independent  cross-section  observations.  Fertility  studies  often  model  the 
number  of  live  births  over  a  specified  age  interval  of  the  mother,  with  interest  in  an¬ 
alyzing  its  variation  in  terms  of,  say,  mother’s  schooling,  age,  and  household  income 
(Winkelmann,  1995).  In  some  models  of  family  decisions  the  number  of  children  may 
appear  as  an  explanatory  variable  with  the  acknowledgment  that  the  variable  is  en¬ 
dogenous.  Accident  analysis  studies  model  airline  safety  as  measured  by  the  number 
of  accidents  experienced  by  an  airline  over  some  period  and  seek  to  determine  its  rela¬ 
tionship  to  airline  profitability  and  other  measures  of  the  financial  health  of  the  airline 
(Rose,  1990).  Recreational  demand  studies  seek  to  place  a  value  on  natural  resources 
such  as  national  forests  by  modeling  the  number  of  trips  to  a  recreational  site  (Gurmu 
and  Trivedi,  1996).  Health  demand  studies  model  data  on  the  number  of  times  that 
individuals  consume  a  health  service,  such  as  visits  to  a  doctor  or  days  in  the  hospital 
in  the  past  year  (Cameron  et  al.,  1988).  If  we  wish  to  analyze  the  relation  between  this 
variable  and  factors  such  as  health  status  and  health  insurance,  again  a  count  regression 
is  relevant. 

The  main  modeling  approaches  are  presented  in  Sections  20.2-20.5.  Section  20.2 
details  the  Poisson  regression  model.  Section  20.3  gives  an  application  to  data  from  the 
famous  RHIE.  The  Poisson  regression  model  is  often  too  restrictive  and  other,  more 
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Table  20.1.  Proportion  of  Zero  Counts  in  Selected  Empirical  Studies 


Study 

Variable 

Sample 

Size 

Proportion 
of  Zeros 

Cameron  et  al.  (1988) 

Doctor  visits 

5,190 

0.798 

Pohlmeier  and  Ulrich  (1995) 

Specialist  visits 

5,096 

0.678 

Grootendorst  (1995) 

Prescription  drugs 

5,743 

0.224 

Deb  and  Trivedi  (1997) 

Number  of  hospital  stays 

4,406 

0.806 

Gurmu  and  Trivedi  (1996) 

Recreational  trips 

659 

0.632 

Geil  et  al.  (1997) 

Hospitalizations 

30,590 

0.899 

Greene  (1997) 

Major  derogatory  reports 

1,319 

0.803 

commonly  used,  fully  parametric  count  models  are  presented  in  Section  20.4.  Less- 
used  alternative  parametric  approaches  for  counts,  such  as  discrete  choice  models,  are 
also  presented  in  this  section.  The  partially  parametric  approach  of  modeling  the  con¬ 
ditional  mean  and  conditional  variance  is  detailed  in  Section  20.5.  Multivariate  count 
models  and  models  with  endogenous  regressors  are  given  an  introductory  treatment  in 
Section  20.6.  Section  20.7  illustrates  various  models  by  application  to  the  RHIE  data. 
This  is  followed  by  a  discussion  of  some  practical  issues.  For  pedagogical  reasons 
the  Poisson  regression  model  for  cross-section  data  is  presented  in  some  detail.  The 
other  models,  many  superior  to  Poisson,  are  presented  in  less  detail  for  space  reasons. 
For  more  complete  treatment  see  Cameron  and  Trivedi  (1998)  and  the  Bibliographic 
Notes. 


20.2.  Basic  Count  Data  Regression 

In  some  cases,  such  as  number  of  births,  the  count  is  the  variable  of  ultimate  inter¬ 
est.  In  other  cases,  such  as  medical  demand  and  results  of  research  and  development 
expenditure,  the  variable  of  ultimate  interest  is  continuous,  often  expenditures  or  re¬ 
ceipts  measured  in  dollars,  but  the  best  data  available  are  instead  a  count.  In  many 
cases,  the  sample  is  concentrated  on  a  few  small  discrete  values,  say  0,  1,  and  2. 
Table  20. 1  illustrates  this  point  by  reference  to  the  proportion  of  zero  counts  observed 
in  several  published  econometric  models;  this  proportion  can  be  as  high  as  90%  in 
some  cases.  Also,  the  data  can  be  skewed  to  the  right.  Finally,  the  data  are  intrinsi¬ 
cally  heteroskedastic  with  variance  increasing  with  the  mean. 


20.2.1.  Poisson  Regression 

The  Poisson  is  the  starting  point  for  count  data  analysis,  though  it  is  often  inadequate. 
In  Sections  20.2.1-20.2.3  we  present  the  Poisson  regression  model,  which  was  pre¬ 
viously  introduced  in  Section  5.2,  and  estimation  by  maximum  likelihood,  interpre¬ 
tation  of  the  estimated  coefficients,  and  extensions  to  truncated  and  censored  data.  In 
Section  20.2.3  we  also  present  the  quasi-MLE  based  on  the  Poisson  distribution  with 
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Table  20.2.  Summary  of  Data  Sets  Used  in  Recent  Patent-R&D  Studies 


Study 

Sample 

Size 

Mean 

Std. 

Error 

Maximum 

Patents 

Proportion 
of  Zeros 

Cincera  (1997) 

181 

60.8 

721.6 

925 

<0.19 

Crepon  and  Duguet  (1997b) 

698 

11.6 

na" 

na 

0.441 

Crepon  and  Duguet  (1997a) 

451 

2.73 

11.45 

na 

0.729 

Hausman  et  al.  (1984) 

346 

32.1 

66.36 

515 

0.220 

Wang  et  al.  (1998) 

70 

23.46 

39.10 

173 

0.186 

na,  not  available. 


correctly  specified  conditional  mean,  but  with  possibly  misspecified  conditional  vari¬ 
ance  function.  Limitations  of  the  Poisson  model,  notably  its  property  of  equidispersion, 
are  presented  in  Section  20.2.4. 

There  is  a  qualification:  In  some  cases  a  high  proportion  of  zeros  in  the  sample 
may  coexist  with  very  large  values  of  counts,  creating  a  difficult  modeling  challenge. 
Table  20.2  illustrates  this  feature  using  information  from  five  studies  that  have  inves¬ 
tigated  the  relationship  between  patent  counts  and  research  and  development  (R&D) 
expenditure.  Observe  how  large  the  maximum  observed  value  of  the  count  is  relative 
to  the  sample  mean.  The  modeling  challenge  is  to  select  a  functional  form  that  can 
adequately  capture  the  large  mean  and  the  high  proportion  of  zeros.  In  many  other 
examples,  such  as  number  of  births,  virtually  all  the  data  are  restricted  to  single  digits, 
and  the  mean  number  of  events  is  quite  low. 

These  features  motivate  the  application  of  special  methods  and  models  for  count 
regression.  There  are  two  ways  to  proceed. 

The  first  approach  is  a  fully  parametric  one  that  completely  specifies  the  distribu¬ 
tion  of  the  data,  fully  respecting  the  restriction  of  y  to  nonnegative  integer  values.  This 
approach  was  taken  in  early  applications,  mostly  in  biostatistics,  where  count  regres¬ 
sion  was  seen  as  an  extension  and  generalization  of  a  vast  literature  on  the  distribution 
of  independent  and  identically  distributed  counts.  It  was  also  taken  in  the  influential 
econometrics  study  by  Hausman  et  al.  (1984). 

The  second  approach  is  a  mean-variance  approach,  which  specifies  the  condi¬ 
tional  mean  to  be  nonnegative  and  specifies  the  conditional  variance  to  be  a  function 
of  the  conditional  mean.  This  models  well  the  nonnegativity  and  heteroskedasticity 
but  does  not  address  the  discreteness  of  the  data.  This  approach,  in  a  framework  not 
limited  to  only  count  data,  was  introduced  by  Nelder  and  Wedderburn  (1972),  lead¬ 
ing  to  the  generalized  linear  model  approach  widely  used  in  statistics  (McCullagh  and 
Nelder,  1989).  In  econometrics  this  approach  was  introduced  by  Gourieroux,  Monfort, 
and  Trognon  (1984a, b)  and  is  best  viewed  as  a  specialization  of  generalized  methods 
of  moments. 


20.2.2.  Poisson  MLE  and  QMLE 

The  Poisson  MLE  and  quasi-MLE  (QMLE)  were  introduced  and  studied  in  Chapter  5 
as  an  example  of  /n -estimation.  Here  we  give  a  more  complete  treatment. 
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The  natural  stochastic  model  for  counts  is  a  Poisson  point  process  for  the  occur¬ 
rence  of  the  event  of  interest.  This  implies  a  Poisson  distribution  for  the  number  of 
occurrences  of  the  event,  with  density,  or  more  formally  probability  mass  function, 

Pr[T  =  y]  =  — — ,  y  =  0,1,2,...,  (20.1) 

y'- 

where  /x  is  the  intensity  or  rate  parameter.  We  refer  to  the  distribution  as  V[fi\.  The 
first  two  moments  are 


E[T]  =  /x,  (20.2) 

V[T]  =  /x. 

This  shows  the  well-known  equidispersion  (equality  of  mean  and  variance)  property 
of  the  Poisson  distribution. 

By  introducing  the  observation  subscript  i,  attached  to  both  y  and  /x.  the  iid  frame¬ 
work  is  extended  to  the  regression  case.  The  Poisson  regression  model  is  derived  from 
the  Poisson  distribution  by  parameterizing  the  relation  between  the  mean  parameter  /x 
and  covariates  (regressors)  x.  The  standard  assumption  is  to  use  the  exponential  mean 
parameterization, 

Hi  =  exp(x-/3),  i  —  1, . . . ,  N,  (20.3) 

where  by  assumption  there  are  K  linearly  independent  covariates,  usually  including  a 
constant.  Because  V[y,  |x,  ]  =  exp(x./3),  by  (20.2)  and  (20.3),  the  Poisson  regression  is 
intrinsically  heteroskedastic. 

Given  (20.1)  and  (20.3)  and  the  assumption  that  the  observations  (y,  |x,  )  are  inde¬ 
pendent,  the  most  natural  estimator  is  maximum  likelihood.  The  log-likelihood  func¬ 
tion  is 

N 

In L(/3)  =  Y^{yix'iP  -  exP(x//3)  -  Iny,  !}.  (20.4) 

i= 1 

The  Poisson  MLE,  denoted  /3P,  is  the  solution  to  K  nonlinear  equations  corresponding 
to  the  first-order  condition  for  maximum  likelihood, 

N 

J](y,  -  exp(x;/3))x,-  =  0.  (20.5) 

i= 1 

If  x,  includes  a  constant  term  then  the  residuals  y,  —  exp(x./3)  sum  to  zero  by  (20.5). 
The  log-likelihood  function  is  globally  concave;  hence  solving  these  equations  by 
a  Gauss-Newton  or  Newton-Raphson  iterative  algorithm  yields  unique  parameters 
estimates. 

In  the  econometrics  literature  pseudo-ML  (PML)  or  quasi-ML  (QML)  estimation 
refers  to  estimating  by  ML,  under  misspecification  of  the  specified  density  (Gourieroux 
et  al.,  1984a).  The  terms  PML  and  QML  are  often  used  interchangeably.  The  distribu¬ 
tion  of  the  estimator  is  obtained  under  weaker  assumptions  about  the  data-generating 
process  than  those  that  led  to  the  specified  likelihood  function;  see  Section  5.7.  In  the 
statistics  literature  QML  often  refers  to  nonlinear  generalized  least-squares  estimation. 
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For  the  Poisson  regression,  QML  in  the  latter  sense  is  equivalent  to  standard  maximum 
likelihood. 

From  (20.5),  the  Poisson  PML  estimator,  /3P,  has  first-order  conditions  — 

exp(x-/3))x,  =  0.  As  already  noted,  the  summation  on  the  left-hand  side  has  expec¬ 
tation  zero  if  E[y,  |x,  ]  =  cxp(x'/3).  Hence  the  Poisson  PML  is  consistent  under  the 
weaker  assumption  of  correct  specification  of  the  conditional  mean;  that  is,  the  data 
need  not  be  Poisson  distributed.  Using  results  given  in  Section  5.2.3,  the  variance  ma¬ 
trix  is  of  the  sandwich  form,  with 

VPMl[/3p]  =  f  X)  o), x, x'  j  Mi x, x' j  (20.6) 

and  0);  =  V  [  y,  |x,  ]  is  the  conditional  variance  of  yl . 

By  standard  ML  theory  if  the  stronger  assumption  is  made  that  the  Poisson  regres¬ 
sion  is  parametrically  correctly  specified,  so  that  co,  =  fii ,  the  estimator  /3P  is  consis¬ 
tent  for  (3  and  asymptotically  normal  with  the  sample  covariance  matrix 

V[/3P]=  ^J>;x(xA  ,  (20.7) 

in  the  case  where  /x,  is  of  the  exponential  form  (20.3). 

The  Poisson  ML  and  PML  estimators  are  identical  but  have  different  variances.  The 
empirical  implementation  of  the  more  robust  estimate  (20.6)  is  presented  in  Section 
20.5.1. 


20.2.3.  Interpretation  of  Regression  Coefficients 

For  linear  models,  with  E[v|x]  =  x' /3,  the  coefficients  (3  are  readily  interpreted  as  the 
effect  of  a  one-unit  change  in  regressors  on  the  conditional  mean.  For  nonlinear  mod¬ 
els  this  interpretation  needs  to  be  modified;  see  the  general  discussion  given  in  Sec¬ 
tion  5.2.4.  For  any  model  with  exponential  conditional  mean,  differentiation  yields 


3E[y|x] 

dxj 


ftj  exp(x'/3), 


(20.8) 


where  the  scalar  x;-  denotes  the  / th  regressor.  For  example,  if  ft ■  =  0.25  and 
exp(x-/3)  =  3,  then  a  one-unit  change  in  the  j th  regressor  increases  the  expectation 
of  y  by  0.75  units.  This  partial  response  depends  on  exp(x|/3),  which  is  expected  to 
vary  across  individuals.  It  is  easy  to  see  that  ftj  measures  the  relative  change  in  E[v|x] 
induced  by  a  unit  change  in  xj.  If  Xj  is  measured  on  a  logarithmic  scale,  ft  j  is  an 
elasticity. 

For  purposes  of  reporting  a  single  response  value,  a  good  candidate  is  an  estimate  of 
the  average  response,  N~l  JT  9E[y,|x,]/9xy  =  ftj  x  N~]  JT  exp(x-/3).  For  Poisson 
regression  models  with  intercept  included,  this  can  be  shown  to  simplify  to  fi  fty. 

Another  consequence  of  (20.8)  is  that  if,  say,  ftj  is  twice  as  large  as  ftk,  then  the 
effect  of  changing  the  j  th  regressor  by  one  unit  is  twice  that  of  changing  the  kth  re¬ 
gressor  by  one  unit. 
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20.2.4.  Overdispersion 

The  Poisson  regression  model  is  usually  too  restrictive  for  count  data,  leading  to  alter¬ 
native  models  presented  in  Sections  20.3  and  20.4.  The  fundamental  problem  is  that 
the  distribution  is  parameterized  in  terms  of  a  single  scalar  parameter  (/x)  so  that  all 
moments  of  y  are  a  function  of  /x.  By  contrast  the  normal  distribution  has  separate 
parameters  for  location  (/x)  and  scale  (cr2).  For  the  same  reason  the  one-parameter 
exponential  is  too  restrictive  for  duration  data  and  more  general  two-parameter  dis¬ 
tributions  such  as  the  Weibull  are  superior.  Note  that  this  complication  does  not  arise 
with  binary  data.  Then  the  distribution  is  clearly  the  one-parameter  Bernoulli,  because 
if  the  probability  of  success  is  p  then  the  probability  of  failure  must  be  1  —  p.  For 
binary  data  the  issue  is  instead  how  to  parameterize  p  in  terms  of  regressors. 

One  way  this  restrictiveness  manifests  itself  is  that  in  many  applications  a  Poisson 
density  predicts  the  probability  of  a  zero  count  to  be  considerably  less  than  is  actually 
observed  in  the  sample.  This  is  termed  the  excess  zeros  problem,  as  there  are  more 
zeros  in  the  data  than  the  Poisson  predicts. 

A  second  and  more  obvious  deficiency  of  the  Poisson  model  is  that  for  count  data 
the  variance  usually  exceeds  the  mean,  a  feature  called  overdispersion.  The  Poisson 
model  instead  implies  equality  of  the  variance  and  the  mean  (see  (20.2)),  a  property 
called  equidispersion. 

Overdispersion  has  qualitatively  similar  consequences  to  the  failure  of  the  assump¬ 
tion  of  homoskedasticity  in  the  linear  regression  model.  Provided  the  conditional  mean 
is  correctly  specified,  that  is,  (20.3)  holds,  the  Poisson  MLE  is  still  consistent.  This  is 
clear  from  inspection  of  the  first-order  conditions  (20.5),  since  the  left-hand  side  of 
(20.5)  will  have  expected  value  of  zero  if  E[y,-  |x,  ]  =  exp(x-/3).  This  consistency  prop¬ 
erty  applies  more  generally  to  the  quasi-MLE  when  the  specified  density  is  in  the 
LEF.  Both  Poisson  and  normal  are  members  of  the  LEF  discussed  earlier  in  Sec¬ 
tion  5.7.3.  It  is  nonetheless  important  to  control  for  overdispersion.  First,  in  more 
complicated  settings  such  as  with  truncation  and  censoring,  overdispersion  leads  to 
the  more  fundamental  problem  of  inconsistency.  Second,  even  in  the  simplest  settings 
large  overdispersion  leads  to  grossly  deflated  standard  errors  and  grossly  inflated  t- 
statistics  in  the  usual  ML  output,  and  hence  it  is  important  to  use  the  previously  given 
robust  variance  estimator.  Third,  if  one  wants  to  estimate  probabilities  of  number  of 
events,  rather  than  merely  the  conditional  mean,  these  depend  on  additional  parameters 
of  the  dgp. 

Overdispersion  may  signal  a  presence  of  a  more  basic  misspecification,  especially 
in  data  settings  that  involve  truncation  and  censoring  if  they  are  ignored  in  estima¬ 
tion.  In  such  a  case  the  conditional  mean  is  incorrectly  specified  and  the  simultaneous 
presence  of  overdispersion  then  leads  to  inconsistency,  not  only  inefficiency,  of  the 
MLE. 

A  statistical  test  of  overdispersion  is  therefore  highly  desirable  after  running  a 
Poisson  regression.  Most  count  models  with  overdispersion  specify  overdispersion  to 
be  of  the  form 


V[y,|x,]  =  Hi  +  ag(Hi), 


(20.9) 
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where  a  is  an  unknown  parameter  and  g(-)  is  a  known  function,  most  commonly 
g(n)  =  /i2  or  g(/i)  =  ii.  It  is  assumed  that  under  both  null  and  alternative  hypothe¬ 
ses  the  mean  is  correctly  specified  as,  for  example,  exp(xJ/3),  whereas  under  the 
null  hypothesis  a  =  0  so  that  V[y,  |x,  |  =  //,.  A  simple  overdispersion  test  statistic 
for  Hq  :  a  =  0  versus  H\  :  a  ^  0  or  H\  :  a  >  0  can  be  computed  by  estimating  the 
Poisson  model,  constructing  fitted  values  /2;  =  exp(x-/3),  and  running  the  auxiliary 
OLS  regression  (without  constant) 


(,V;  -  -  yt 

% 


a  — 


+  Uj , 


(20.10) 


where  «,  is  an  error  term.  The  reported  f -statistic  for  a  is  asymptotically  normal  under 
the  null  hypothesis  of  no  overdispersion  (Cameron  and  Trivedi,  1990)  even  though 
here  generated  regressors  are  used.  This  test  can  also  be  used  for  underdispersion, 
a  <  0,  in  which  case  the  conditional  variance  is  less  than  the  conditional  mean.  See 
also  Gurmu  and  Trivedi  (1992). 


20.3.  Count  Example:  Contacts  with  Medical  Doctor 

For  illustration  we  use  some  of  the  data  from  the  RAND  Health  Insurance  Experi¬ 
ment  previously  used  by  Deb  and  Trivedi  (2002).  They  estimated  a  more  complete  set 
of  models  and  carried  out  a  deeper  analysis  of  the  data  than  is  possible  or  desirable 
here.  The  experiment,  conducted  by  the  RAND  Corporation  from  1974  to  1982,  has 
been  the  longest  running  and  largest  controlled  social  experiment  in  medical  care  re¬ 
search.  The  main  goal  of  the  experiment  was  to  assess  how  the  patient’s  use  of  health 
services  is  affected  by  types  of  randomly  assigned  health  insurance,  including  both 
fee-for-service  and  health  maintenance  organizations  (HMOs).  In  the  experiment  the 
data  were  collected  from  about  8,000  enrollees  in  2,823  families,  from  six  sites  across 
the  country.  Each  family  was  enrolled  in  one  of  14  different  health  insurance  plans  for 
either  three  or  five  years.  The  plans  ranged  from  free  care  to  95%  coinsurance  below  a 
maximum  dollar  expenditure  (MDE),  and  also  included  assignment  in  a  prepaid  group 
practice. 

The  key  point  is  that  because  insurance  plans  are  randomly  assigned,  not  freely 
chosen  by  the  participants,  we  do  not  face  the  problem  of  endogenous  treatment  effect, 
which  is  the  central  causal  parameter  of  interest  in  the  study. 

Data  were  collected  from  the  enrollee’s  use  of  medical  care  services  and  health  sta¬ 
tus  throughout  the  randomly  assigned  term  of  enrollment  for  either  three  or  five  years. 
For  additional  details  of  the  data  see  Manning  et  al.  (1987),  Newhouse  et  al.  (1993), 
and  Deb  and  Trivedi  (2002).  The  sample  used  in  this  study  consists  of  individuals  in 
the  fee-for-service  plans  only. 

The  data  file  consists  of  utilization,  expenditures,  demographic  characteristics, 
health  status,  and  insurance  status  variables.  The  expenditure  data  were  analyzed  in 
Section  16.6.  The  coinsurance  rate  in  this  sample  assumes  four  different  values.  Yet, 
following  the  RAND  studies,  we  treat  it  as  a  continuous  variable.  The  final  sample 
consists  of  20,186  observations;  each  observation  represents  data  for  an  experimental 
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Table  20.3.  Contacts  with  Medical  Doctor:  Frequency  Distribution 


Contacts 

0 

1 

2 

3 

4 

5 

6 

7 

8  9 

10 

Relative  Frequency 

31.2 

18.9 

13.8 

9.3 

6.7 

4.8 

3.4 

2.6 

2.0  1.4 

1.0 

Contacts 

11 

12 

13 

14 

15 

16 

>21 

Max 

Relative  Frequency 

0.9 

0.6 

0.5 

0.4 

0.3 

0.3 

1.0 

77 

subject  in  a  given  year.  For  simplicity  of  exposition  the  resulting  clustering  in  the  data, 
see  Section  24.5,  is  ignored  here. 

In  the  present  illustration  the  measure  of  utilization  analyzed  is  the  number  of  con¬ 
tacts  with  a  medical  doctor  (MDU).  The  relative  frequency  distribution  of  MDU,  given 
in  percentages,  is  given  in  Table  20.3.  MDE  denotes  maximum  dollar  expenditure,  the 
medical  expenditure  liability  limit  set  in  the  experiment  above  which  the  participant 
would  not  be  responsible  for  cost-sharing.  Observe  that  about  31%  of  the  observations 
are  zeros.  The  long  right  tail  and  variance  greatly  exceeding  the  mean  indicates  that 
the  counts  are  (unconditionally)  overdispersed. 

For  the  purposes  of  discussion  here  we  consider  the  regression  to  be  estimated  by 
Poisson  ML  and  by  Poisson  PML.  Other  specifications  are  considered  later.  The  in¬ 
cluded  covariates  in  all  cases  are  those  in  Table  20.4. 


Table  20.4.  Contacts  with  Medical  Doctor:  Variable  Descriptions 


Variable 

Definition 

Mean 

Std.  Dev. 

MDU 

Number  of  outpatient  visits  to  an  MD 

2.861 

4.505 

LC 

ln(coinsurance  +  1),  0  <  coinsurance  <100 

1.710 

1.962 

IDP 

1  if  individual  deductible  plan,  0  otherwise 

0.220 

0.414 

LPI 

ln(max(l,annual  participation  incentive  payment)) 

4.709 

2.697 

FMDE 

0  if  IDP  =  1 

ln(max(l,MDE/(0.01  coinsurance)))  otherwise 

3.153 

3.641 

LINC 

ln(  family  income) 

8.708 

1.228 

LFAM 

ln(  family  size) 

1.248 

0.539 

AGE 

Age  in  years 

25.718 

16.768 

FEMALE 

1  if  person  is  female 

0.517 

0.500 

CHILD 

1  if  age  is  less  than  18 

0.402 

0.490 

FEMCHILD 

FEMALE  *  CHILD 

0.194 

0.395 

BLACK 

1  if  race  of  household  head  is  black 

0.182 

0.383 

EDUCDEC 

Education  of  the  household  head  in  years 

11.967 

2.806 

PHYSLIM 

1  if  the  person  has  a  physical  limitation 

0.124 

0.322 

NDISEASE 

Number  of  chronic  diseases 

11.244 

6.742 

HLTHG 

1  if  self-rated  health  is  good 

0.362 

0.481 

HLTHF 

1  if  self-rated  health  is  fair 

0.077 

0.267 

HLTHP 

1  if  self-rated  health  is  poor 

Omitted  category  is  excellent  self-rated  health 

0.015 

0.121 
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Table  20.5.  Contacts  with  Medical  Doctor:  Count  Model  Estimates 


Model 

Poisson 

PPML 

NB2-PML 

Coeff. 

t -ratio 

t -ratio 

Coeff. 

t -ratio 

LC 

-.0427 

-7.030 

-2.835 

-0.0504 

-3.228 

IDP 

-.1613 

-13.881 

-5.773 

-0.1475 

-4.889 

LPI 

0.0128 

6.999 

2.912 

0.0158 

3.574 

FMDE 

-.0206 

-5.803 

-  2.319 

-0.0213 

-2.351 

PHYSLIM 

0.2684 

21.711 

8.240 

0.2751 

8.068 

NDISEASE 

0.0231 

38.124 

13.487 

0.0259 

15.324 

HLTHG 

0.0394 

4.109 

1.699 

0.0065 

0.275 

HLTHF 

0.2531 

15.613 

5.894 

0.2368 

5.425 

HLTHP 

0.5216 

19.150 

6.966 

0.4256 

6.205 

a 

- 

- 

- 

1.1822 

8.926 

—In  L 

60087 

42777 

A  selection  of  interesting  coefficients  and  their  f-ratios  are  given  in  Table  20.5, 
along  with  log-likelihood  and  information  criteria.  To  save  space  we  do  not  reproduce 
all  the  output.  The  coefficients  of  variables  associated  with  insurance  variables  (LC, 
IDP,  LPI,  and  FMDE)  are  clearly  of  interest  since  they  reflect  the  price  sensitivity 
of  utilization.  Also  of  interest  are  the  coefficients  of  the  five  health  status  variables 
(PHYSLIM,  NDISEASE,  HLTHG,  HLTHF,  and  HLTHP). 

Consider  the  coefficient  of  the  coinsurance  rate,  here  measured  on  the  log  scale, 
LC.  This  variable  is  of  major  interest  as  it  provides  information  about  the  price  effect. 
The  higher  the  coinsurance  rate,  the  greater  will  be  the  extent  of  cost  sharing  by  the 
patient,  and  hence  the  lower  will  be  the  average  number  of  visits.  The  estimated  coef¬ 
ficient  from  the  Poisson  regression  (see  column  1  in  Table  20.5)  is  negative  (—.042), 
with  a  f -ratio  of  2.835,  indicating  that  the  price  effect  is  significantly  negative  as  pre¬ 
dicted  by  standard  theory.  The  elasticity  of  the  number  of  doctor  visits  with  respect  to 
LC  is  —.042.  However,  care  should  be  exercised  in  interpreting  this  value  as  the  coin¬ 
surance  rate  only  takes  a  few  values  and  does  not  vary  continuously.  Subject  to  this 
qualification,  the  coefficient  can  be  interpreted  as  elasticity.  A  similar  value  for  log  of 
income  (LINC)  is  0.174,  indicating  that  increase  in  income  raises  the  average  number 
of  visits. 

How  well  does  the  Poisson  regression  fit  the  data?  One  simple  way  to  judge  this 
is  to  compare  the  actual  and  fitted  frequencies  for  different  number  of  doctor  visits. 
Table  20.6  provides  such  a  comparison  for  up  to  nine  visits,  ignoring  the  higher  fre¬ 
quencies  that  collectively  account  for  less  than  10%  of  the  visits.  To  calculate  the  fitted 
value  Pr[y,  |x'/3]  for  y,  =  0,  1,  . . . ,  9,  we  plug  /r,  into  (20.1)  and  then  average  over  all 
the  observations.  Observe  that  the  Poisson  regression  seriously  underpredicts  the  pro¬ 
portion  of  zero  visits  and  overestimates  the  proportion  of  positive  number  of  visits  up 
to  seven.  Thus  we  conclude  that  the  Poisson  regression  is  deficient.  This  pattern  in  the 
lack  of  fit  can  be  shown  to  be  associated  with  the  neglect  of  overdispersion  in  the  data 
(Cameron  and  Trivedi,  1998,  chapter  4). 
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Table  20.6.  Contacts  with  Medical  Doctor:  Observed  and  Fitted  Frequencies 


Contact  frequency 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Relative  frequency 

31.2 

18.9 

13.8 

9.3 

6.7 

4.8 

3.4 

2.6 

2.0 

1.4 

Poisson  fitted 

10.6 

19.2 

20.9 

17.6 

12.6 

7.99 

4.69 

2.64 

1.46 

0.8 

NB2  fitted 

30.9 

19.6 

13.6 

9.67 

6.97 

5.07 

3.70 

2.72 

2.0 

1.47 

In  the  presence  of  neglected  overdispersion  it  is  to  be  expected  that  the  /-ratios  of 
the  Poisson  MLE  will  be  inflated.  A  comparison  with  the  robust  f -ratios  in  column  3 
(PPML)  of  Table  20.5  shows  that  this  is  indeed  so.  For  example,  robustification  causes 
the  /-ratio  of  LC  to  drop  from  —7.03  to  —2.83.  Tables  20.5  and  20.6  include  results 
for  the  NB2  model  that  are  discussed  in  Section  20.7.  The  NB2  model  is  a  better 
parametric  model  for  these  data. 


20.4.  Parametric  Count  Regression  Models 

Poisson  regression  is  often  too  restrictive.  In  this  section  we  present  a  number  of  more 
flexible  parametric  alternatives  to  the  Poisson. 

First,  overdispersion  in  count  data  may  be  due  to  unobserved  heterogeneity.  In  such 
a  case  counts  are  viewed  as  being  generated  by  a  Poisson  process  (in  which  case  the 
events  are  serially  independent),  but  the  researcher  is  unable  to  correctly  specify  the 
rate  parameter  of  this  process.  Instead,  the  rate  parameter  is  itself  a  random  variable. 
This  mixture  approach,  presented  in  Sections  20.4.1  and  20.4.2,  leads  to  the  widely 
used  negative  binomial  model. 

Second,  overdispersion,  and  in  some  cases  underdispersion,  may  arise  because  the 
process  generating  the  first  event  may  differ  from  that  determining  later  events.  For  ex¬ 
ample,  an  initial  doctor  consultation  may  be  solely  a  patient’s  choice,  whereas  subse¬ 
quent  visits  are  also  determined  by  the  doctor.  This  leads  to  the  modified  count  models 
presented  in  Section  20.4.5. 

Third,  overdispersion  in  count  data  may  be  due  to  failure  of  the  assumption  of  in¬ 
dependence  of  events,  which  is  implicit  in  the  Poisson  process.  One  can  postulate 
dependence  so  that,  for  example,  the  occurrence  of  one  doctor  visit  makes  subse¬ 
quent  doctor  visits  more  likely.  (This  approach  has  not  been  widely  used  in  count 
data  analysis.  In  duration  data  analysis  this  is  called  true  state  dependence.)  Particular 
assumptions  about  unobserved  heterogeneity  or  dependence  again  lead  to  the  negative 
binomial;  see  Winkelmann  (1995).  A  discrete  choice  model  that  progressively  models 
Pr[y  =  j\y  >  j  —  1]  is  presented  in  Section  20.4.6. 

Fourth,  one  can  refer  to  the  extensive  and  rich  literature  on  univariate  iid  count 
distributions,  such  as  the  logarithmic  series  and  hypergeometric  distribution  (Johnson, 
Kotz,  and  Kemp,  1992).  New  regression  models  can  be  developed  by  letting  one  or 
more  distribution  parameters  be  a  specified  function  of  regressors.  Such  models  are 
not  presented  here.  The  approach  has  less  motivation  than  the  first  three  approaches 
and  the  resulting  models  may  not  be  any  better. 
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Although  overdispersion  has  been  emphasized,  underdispersion  may  also  arise.  For 
example,  a  sample  in  which  the  counted  outcome  is  largely  0  or  1 ,  with  a  very  small 
number  of  2s,  and  hence  close  to  a  binomial  model,  will  show  underdispersion.  Mem¬ 
bers  of  the  Katz  family  of  distributions,  or  other  distributions  based  on  the  series  ex¬ 
pansion  methods  such  as  those  developed  in  Cameron  and  Johansson  (1997),  can  be 
used;  see  also  Cameron  and  Trivedi  (1998,  chapter  12). 


20.4.1.  Negative  Binomial  Model 

The  negative  binomial  model,  a  specific  example  of  a  continuous  mixture  model,  can 
be  obtained  in  many  different  ways.  The  following  justification  using  a  mixture  distri¬ 
bution  is  one  of  the  oldest  and  has  wide  appeal. 

Suppose  the  distribution  of  a  random  count  y  is  Poisson,  conditional  on  the  pa¬ 
rameter  X,  so  that  f(y\X)  =  exp(— X)Xy /y !.  Suppose  now  that  the  parameter  X  is 
random,  rather  than  being  a  completely  deterministic  function  of  regressors  x.  In 
particular,  let  X  =  /iv.  where  /x  is  a  deterministic  function  of  x,  for  example  exp(x'/3), 
and  v  >  0  is  iid  with  density  g{v\oi).  This  is  an  example  of  unobserved  heterogene¬ 
ity,  as  different  observations  may  have  different  X  (heterogeneity)  but  part  of  this 
difference  is  due  to  a  random  (unobserved)  component  v.  Note  that  E[A|/x]  =  // 
if  E[u]  =  1,  so  the  interpretation  of  the  slope  parameters  stays  as  in  the  Poisson 
model. 

The  marginal  density  of  y,  unconditional  on  the  random  parameter  v  but  conditional 
on  the  deterministic  parameters  fi  and  a,  is  obtained  by  integrating  out  v.  This  yields 

h(y\fi,a)  =  J  f(y\n,v)g(v\a)dv,  (20.11) 


where  g(v \a)  is  called  the  mixing  distribution  and  a  denotes  the  unknown  parameter 
of  the  mixing  distribution.  The  integration  defines  an  “average”  distribution.  For  some 
specific  choices  of  /(•)  and  g(-),  the  integral  will  have  an  explicit  or  closed-form 
solution. 

If  f(y\X)  is  the  Poisson  density  and  g(v)  =  vs~le~l’sSs /  F  (5),  v,  S  >0,  is  the 
gamma  density  with  E[v]  =  1  and  V[i’]  =  1/5,  we  obtain  the  negative  binomial  as  a 
mixture  density  as  follows: 


h[y\n,8] 


-L 

-L 


00  e~^v  (/u  v)-v  vs~le~vSS 
y!  T(5) 

00  e-Q*+S)v  y  vy+S-l8S 


dv 


yi 


/x-'d-  r 

Jo 


rWy'- 

fj,ySsr  (y  +  S) 

r  (S)y!(fi  +  sy+s 
r(a~1  +  y) 

r(a-1)r(y+  1) 


-dv 

r  (8) 

-(M-S)vvy+S-ldv 


/X 


/x  +  a 


-1 


(20.12) 
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where  a  =  1/5,  T(-)  denotes  the  gamma  integral  which  specializes  to  a  factorial  for 
an  integer  argument,  and  the  fourth  line  follows  after  some  algebra  and  use  of  the 
definition  of  the  gamma  function.  Special  cases  of  the  negative  binomial  include  the 
Poisson  (a  =  0),  ‘the  advantage  of  reparametrization  from  5  to  a and  the  geometric 
(«  =  1). 

As  in  the  case  of  many  mixture  distributions,  the  negative  binomial  also  has  inde¬ 
pendent  justification;  see  Cameron  and  Trivedi  (1998,  chapter  4).  It  can  arise  in  many 
different  ways  and  one  need  not  always  think  of  it  as  a  mixture  distribution. 

The  algebraic  derivation  of  the  negative  binomial  as  a  Poisson-gamma  mixture 
can  be  given  a  Bayesian  interpretation.  The  prior  distribution  of  //  is  gamma,  given 
a,  and  the  results  on  conjugate  priors  for  exponential  families  in  Section  13.2.4.  It  is 
expected  that  the  posterior  distribution  has  a  closed  form.  Therefore,  the  MLE  and  the 
Bayesian  posterior  mean,  under  the  further  assumption  of  a  vague  prior  on  a,  would 
coincide. 

The  first  two  moments  of  the  negative  binomial  distribution  are 

E[y\n,a]  =  n,  (20.13) 

V[y|/r,  a]  =  n(l  +  a/i). 

The  variance  therefore  exceeds  the  mean,  since  a  >  0  and  fi  >  0.  Indeed,  it  can  be 
shown  easily  that  overdispersion  always  arises  if  v  |  A  is  Poisson  and  unobserved  het¬ 
erogeneity  is  of  the  multiplicative  form  X  =  jiv,  where  E[v]  =  1.  Note  also  that  the 
overdispersion  is  of  the  form  (20.9)  discussed  in  Section  20.2.4. 

Two  standard  variants  of  the  negative  binomial  are  used  in  regression  applications. 
Both  variants  specify  /r,  =  exp(x'/3).  The  most  common  variant  lets  a  be  a  param¬ 
eter  to  be  estimated,  in  which  case  the  conditional  variance  function,  /i  +  a /i 2  from 
(20.13),  is  quadratic  in  the  mean. 

The  other  variant  of  the  negative  binomial  model  has  a  linear  variance  function, 
V[y |/T,  a]  =  (1  +  y)/z,  obtained  by  replacing  a  by  y//r  throughout  (20.12).  Estima¬ 
tion  by  ML  is  again  straightforward.  Sometimes  this  variant  is  called  negative  bino¬ 
mial  1  (NB1)  in  contrast  to  the  variant  with  a  quadratic  variance  function  which  has 
been  called  the  negative  binomial  2  (NB2)  model  (Cameron  and  Trivedi,  1998).  The 
log-likelihood  is  easily  obtained  from  (20.12).  Both  variants  of  the  model  are  easily 
estimated  by  ML,  with  details  given  in,  for  example,  Cameron  and  Trivedi  (1998).  In 
both  variants  the  coefficients  have  the  same  interpretation  since  E[y|x]  =  exp 
The  NB2  variant  is  the  most  often  used,  as  in  the  application  in  Section  20.7. 

The  NB2  model  has  been  found  to  be  very  useful  in  applied  work.  It  appears  to 
have  the  flexibility  necessary  for  providing  a  good  fit  to  many  types  of  count  data.  It 
does  so  in  part  because  the  quadratic  variance  specification  is  a  good  approximation 
in  many  empirical  situations.  An  unfortunate  consequence  of  the  fact  that  NB2  often 
provides  a  good  fit  is  that  if  the  Poisson  assumption  fails  there  is  a  tendency  to  jump 
to  the  negative  binomial  alternative,  ignoring  other  possibilities.  Such  a  mechanical 
approach  should  be  avoided  because  poor  performance  of  the  Poisson  can  also  be  due 
to  a  poor  specification  of  the  conditional  mean  function,  and  observe  that  using  the 
negative  binomial  model  maintains  the  same  conditional  mean. 
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The  negative  binomial  model  is  less  robust  to  distributional  misspecification  than 
the  Poisson.  Even  if  the  conditional  mean  is  correctly  specified  the  MLE  in  negative 
binomial  models  is  inconsistent,  except  for  the  special  case  of  the  NB2  model,  whereas 
the  MLE  for  (3  (but  not  a)  is  still  consistent. 

For  mixture  models  for  counts,  the  Poisson  is  the  natural  choice  for  the  initial  den¬ 
sity  f(y l/L  v)  in  (20.12)  since  a  Poisson  process  is  a  natural  model  for  counts.  The 
choice  of  the  gamma  for  the  mixing  distribution  g(v)  in  (20.12)  is  more  arbitrary.  Its 
use  raises  issues  discussed  in  Section  18.2-18.4.  Other  possible  choices  include  the 
lognormal  distribution  and  the  inverse-Gaussian  distribution.  See  Willmot  (1987)  and 
Guo  and  Trivedi  (2002).  In  these  cases  the  marginal  distribution  cannot  be  expressed 
in  a  closed  form,  as  it  is  the  gamma  that  is  conjugate  to  the  Poisson.  Of  course,  this 
does  not  mean  that  the  resulting  model  cannot  be  estimated  by  maximum  likelihood.  It 
means  simply  that  one  may  have  to  use  numerical  quadrature  or  simulated  maximum 
likelihood  to  estimate  the  model.  These  methods  are  entirely  feasible  with  currently 
available  computing  power.  If  one  is  prepared  to  use  the  simulation-based  estimation 
methods  discussed  in  chapter  12,  the  scope  for  using  mixed-Poisson  models  of  various 
types  becomes  very  extensive. 


20.4.2.  Simulated  Maximum  Likelihood 

Purely  for  purposes  of  illustration  we  now  illustrate  how  we  might  estimate  the  NB2 
model  by  maximum  simulated  likelihood.  The  reader  should  understand  that  in  prac¬ 
tice  this  is  unnecessary  because  we  already  have  an  analytical  expression  for  that 
model.  Suppose  we  pretend  that  we  do  not  and  tackle  estimation  by  simulation. 

Note  that  /7(_y|a,  /x)  in  (20.12)  can  be  approximated  by 

1  ^  e-™  (/xvsy 

s^-f  y\ 

5=1  J 


where  vs  (s  =  1, . . . ,  S)  are  pseudo-random  draws  from  the  distribution  g(v\a).  and 
S  is  the  number  of  simulation  replications  used.  Drawing  from  a  gamma  distribution 
with  mean  1  and  variance  a  is  straightforward.  One  draws  from  a  uniform  distribution 
and  then  applies  a  transformation  to  it.  Let  us  denote  the  uniform  random  variables 
and  let  vs  =  —  In  us /a,  and  then  define  the  simulator 


f(y\vs,  a ,  /x) 


in «,/«)  (M  (_  in  us/oc))y 

v! 


Then  the  MSL  estimator  dyisi.  maximizes 


N  fl  s  ~  > 

Qn(Q)  =  ln 

>■= i  V5  i=t  ) 


(20.14) 


where  /x,  =  exp(x'/3)  and  6  =  (a,  (3). 

Of  course,  this  method  is  computer  intensive  but  otherwise  straightforward.  A  fuller 
discussion  of  the  properties  of  MSL  was  given  earlier  in  Chapter  12.4.  Here  we  just 
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remind  the  reader  that  when  S,  N  — >  oo,  S / -J~N  0  then  0msl  and  #ml  are  asymp¬ 

totically  equivalent. 


20.4.3.  Finite  Mixture  Models 

The  mixture  model  in  the  previous  section  was  a  continuous  mixture  model,  because 
the  mixing  random  variable  v  was  assumed  to  have  continuous  distribution.  An  al¬ 
ternative  approach  instead  uses  a  discrete  representation  of  unobserved  heterogeneity, 
which  generates  a  class  of  models  called  finite  mixture  models;  see  Section  18.5. 
This  class  of  models  is  a  particular  subclass  of  latent  class  models.  Some  variants  and 
special  cases  of  this  model  are  also  known  as  discrete  factor  models. 

In  empirical  work  the  more  commonly  used  alternative  to  the  continuous  mixture  is 
found  in  the  class  of  modified  count  models  discussed  in  the  next  section.  However,  it 
is  more  natural  to  follow  up  the  preceding  section  with  a  discussion  of  finite  mixtures. 
Further,  the  subclass  of  modified  count  models  can  be  viewed  as  a  special  case  of  finite 
mixtures. 

We  suppose  that  the  density  of  y  is  a  linear  combination  of  m  different  densities, 
where  the  y'th  density  is  /7(y|07),  j  =  1,2,...,  m.  Thus  an  m -component  finite  mix¬ 
ture  is 


/(y|0,7r)  =  £>7/7(y|07),  0  <  tt7  <  1,  £>7  =  1.  (20.15) 

7=1  7=1 

In  the  given  formulation  the  components  of  the  mixture  are  assumed,  for  generality, 
to  differ  in  all  their  parameters.  More  restrictive  formulations  assume  that  only  some 
parameters  differ  across  the  components  (e.g.,  the  intercepts)  and  the  remaining  param¬ 
eters  are  all  common  to  the  mixture  components.  Assumptions  at  some  intermediate 
level  of  generality  may  also  be  made. 

For  further  insight  consider  this  approach  for  the  m  =  2  case.  Suppose  that  the 
sampled  population  contains  two  “types”  of  cases,  whose  y-outcomes  are  character¬ 
ized  by  distributions  fi(y\0i)  and  /ity  |6F),  which  we  assume  have  different  moments. 
Suppose  type-1  subpopulation  has  mean  fi(9 1),  and  type -2  subpopulation  has  mean 
where  /x(0i)  <  /x(0 1).  For  example,  in  a  study  of  the  use  of  medical  services, 
the  first  subpopulation  corresponds  to  frequent  users  of  the  service  and  the  second  to 
relatively  infrequent  users.  Assume  that  the  fractions  of  the  two  types  in  the  popula¬ 
tions  are  7t\  and  7r2(=  1  —  jti),  respectively.  Then  a  random  sample  drawn  from  the 
population  will  contain  proportions  jri  and  Jt2  of  the  two  types,  although  one  cannot 
observe  which  case  belongs  to  which  subpopulation.  That  is,  the  “types”  are  latent 
classes. 

The  goal  of  the  researcher  who  uses  this  model  is  to  estimate  the  unknown  param¬ 
eters  6j,  j  =  I . in.  It  is  easy  to  develop  regression  models  based  on  (20.15).  For 

example,  if  NB2  models  are  used  then  /7(y|07)  is  the  NB2  density  (20.12)  with  pa¬ 
rameters  fi j  =  exp and  an  so  9  j  =  ( (3, ,  a7).  If  the  number  of  components,  m , 
is  given,  then  under  some  regularity  conditions  maximum  likelihood  estimation  of  the 
parameters  (jtj,  9 j),  j  =  1, . . . ,  m,  is  possible. 
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The  pros  and  cons  of  the  finite  mixture  representation  have  also  been  given  earlier 
and  will  only  be  briefly  mentioned  here.  Further  discussion  in  the  context  of  dura¬ 
tion  models  is  in  Section  18.5.  First,  a  finite  mixture  is  a  flexible  and  parsimonious 
method  of  modeling  the  data.  Each  mixture  component  provides  a  local  approxima¬ 
tion  to  some  part  of  the  true  distribution.  Second,  the  finite  mixture  approach  is  in  a 
sense  semiparametric  because  it  does  not  require  any  distributional  assumptions  for 
the  mixing  variable.  Finally,  in  many  cases  the  results  are  easy  to  interpret.  The  finite 
mixture  representation  is  attractive  if  the  investigator  is  especially  interested  in  the 
behavior  of  a  subpopulation  from  the  viewpoint  of  public  policy.  If  latent  classes  are 
ignored,  so  m  =  1 ,  then  the  estimated  parameters  will  be  weighted  sums  of  the  latent 
class  parameters. 

There  are  several  potential  difficulties  also.  First,  we  may  have  very  little  theoretical 
guidance  on  specifying  the  number  of  components,  and  we  may  not  be  able  to  reliably 
distinguish  among  some  of  the  components  if  they  are  not  sufficiently  different.  The 
usual  practice  is  to  start  with  a  few  components  and  then  add  additional  components 
if  the  fit  of  the  model  is  significantly  improved  by  doing  so.  In  some  cases  only  the 
intercepts  may  be  allowed  to  differ  and  the  slopes  may  be  constrained  to  equality  across 
components.  Caution  is  necessary  in  this  process  because  the  sampling  properties  of 
the  maximum  likelihood  estimator  are  not  fully  understood  for  the  case  in  which  m  is 
unknown. 

There  are  several  studies  that  indicate  that  finite  mixture  models  work  quite  well  for 
count  data  models  of  medical  care  (Deb  and  Trivedi,  1997;  2002).  One  possible  reason 
for  this  is  that  the  population  might  be  split  by  the  latent  health  status  of  individuals. 
Those  who  are  healthy,  perhaps  the  majority,  might  account  for  low  average  demand, 
whereas  those  who  are  ill  may  account  for  high  average  demand.  When  the  observed 
health  status  is  imperfectly  observed,  the  finite  mixture  model  may  do  a  good  job  of 
separating  subpopulations. 


20.4.4.  Truncation  and  Censoring 

In  some  studies,  inclusion  in  the  sample  requires  that  sampled  individuals  have  been 
engaged  in  the  activity  of  interest.  Then  the  count  data  are  truncated,  as  the  data  are 
observed  only  over  part  of  the  range  of  the  response  variable.  Examples  of  truncated 
counts  include  the  number  of  bus  trips  made  per  week  in  surveys  taken  on  buses, 
the  number  of  shopping  trips  made  by  individuals  sampled  at  a  mall,  and  the  number 
of  unemployment  spells  among  a  pool  of  unemployed.  In  all  these  cases  we  do  not 
observe  zero  counts,  so  the  data  are  said  to  be  zero-truncated,  or  more  generally 
left-truncated.  Right-truncation  results  from  loss  of  observations  greater  than  some 
specified  value. 

A  general  treatment  of  truncated  and  censored  models,  using  ML  estimation,  is 
given  in  Section  16.2.  Here  we  specialize  to  count  data. 

Truncation  leads  to  inconsistent  parameter  estimates  unless  the  likelihood  function 
is  suitably  modified.  Consider  the  case  of  zero  truncation.  Let  f(y  \  9)  denote  the  den¬ 
sity  function  and  F(y\6)  =  Pr[  Y  <  v]  denote  the  cumulative  distribution  function  of 
the  discrete  random  variable,  where  6  is  a  parameter  vector.  If  realizations  of  y  less 
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than  the  positive  integer  1  are  omitted,  the  ensuing  zero-truncated  density  is  given  by 


f(y\e,y>D  = 


f(y\o ) 

1  -  F(O|0)’ 


y=  1,2,...  . 


(20.16) 


This  specializes  in  the  zero-truncated  Poisson  case,  for  example,  to  f(y\n.  y  >  1)  = 
e~^ixy /\y  !(1  —  exp(— /x))].  It  is  straightforward  to  construct  a  log-likelihood  based  on 
this  density  and  to  obtain  maximum  likelihood  estimates. 

Censored  counts  most  commonly  arise  from  aggregation  of  counts  greater  than 
some  value.  This  is  often  done  in  survey  design  when  the  total  probability  mass  over 
the  aggregated  values  is  relatively  small.  An  important  difference  between  truncation 
and  censoring  is  that  in  the  case  of  the  latter,  covariates  corresponding  to  the  cen¬ 
sored  counts  are  observed;  in  the  truncation  case  neither  the  counted  outcomes  nor 
the  covariates  are  observed.  Censoring,  like  truncation,  leads  to  inconsistent  parameter 
estimates  if  the  uncensored  likelihood  is  mistakenly  used.  See  also  Section  16.2. 

For  example,  the  number  of  events  greater  than  some  known  value  c  might  be  ag¬ 
gregated  into  a  single  category.  Then  some  values  of  y  are  incompletely  observed;  the 
precise  value  is  unknown  but  it  is  known  to  equal  or  exceed  c.  The  observed  data  has 
density 


g(y\0)  = 


f(y\6)  if  y<c, 

1  —  F(c  —  1 1 0)  if  y  >  c, 


(20.17) 


where  c  is  known. 

A  related  complication  is  that  of  sample  selection  (Terza,  1998).  Then  the  count  y 
is  observed  only  when  another  random  variable,  potentially  correlated  with  v,  crosses 
a  threshold.  For  example,  to  see  a  medical  specialist  one  may  first  need  to  see  a  general 
practitioner. 


20.4.5.  Modified  Count  Models 

The  leading  motivation  for  the  modified  count  models  of  this  section  is  to  solve  the  so- 
called  problem  of  excess  zeros,  the  presence  of  more  zeros  in  the  data  than  predicted 
by  count  models  such  as  the  Poisson,  or  even  NB2. 


Hurdle  or  Two-Part  Models 


The  hurdle  model  or  two-part  model  (see  Section  16.4)  relaxes  the  assumption  that 
the  zeros  and  the  positives  come  from  the  same  data-generating  process.  The  zeros  are 
determined  by  the  density  f\  (•),  so  that  Pr[y  =  0]  =  /i(0).  The  positive  counts  come 
from  the  truncated  density  /?  (y[y  >  0)  =  /2(y)/(l  —  /AO)),  which  is  multiplied  by 
Pr[y  >  0]  =  1  —  /i  (0)  to  ensure  that  probabilities  sum  to  unity.  Thus 


g(y)  = 


fi(0) 
1  ~  /i(0) 

1  -  fi(  0) 


/2O0 


if  y  =  0, 


if  y  >  1. 


(20.18) 


This  reduces  to  the  standard  model  only  if  /i(-)  =  fii;).  Thus  in  the  modified  model 
the  two  processes  generating  the  zeros  and  the  positives  are  not  constrained  to  be 
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the  same.  Although  the  motivation  for  this  model  is  to  handle  excess  zeros,  it  is  also 
capable  of  modeling  too  few  zeros. 

Maximum  likelihood  estimation  of  the  hurdle  model  involves  separate  maximiza¬ 
tion  of  the  two  terms  in  the  likelihood,  one  corresponding  to  the  zeros  and  the  other  to 
the  positives.  This  is  straightforward. 

A  hurdle  model  has  the  interpretation  that  it  reflects  a  two-stage  decision-making 
process.  For  example,  a  patient  may  initiate  the  first  visit  to  a  doctor,  but  the  second 
and  subsequent  visits  may  be  determined  by  a  different  mechanism  (Pohlmeier  and 
Ulrich,  1995). 

Regression  applications  use  hurdle  versions  of  the  Poisson  or  negative  binomial, 
obtained  by  specifying  /i(-)  and  /2O)  to  be  the  Poisson  or  negative  binomial  densities 
given  earlier.  In  application  the  covariates  in  the  hurdle  part  that  models  the  zero/one 
outcome  need  not  be  the  same  as  those  that  appear  in  the  truncated  part,  although  in 
practice  they  are  often  the  same.  The  hurdle  model  is  widely  used,  and  the  hurdle 
negative  binomial  model  is  quite  flexible.  Drawbacks  are  that  the  model  is  not  very 
parsimonious,  typically  the  number  of  parameters  is  doubled,  and  parameter  interpre¬ 
tation  is  not  as  easy  as  in  the  same  model  without  hurdle. 

The  choice  of  the  distribution  in  the  hurdle  specification  is  important.  Using  a  more 
flexible  distribution  gives  the  negative  binomial  obvious  advantages  over  the  Poisson. 
The  conditional  mean  in  the  hurdle  model  is  the  product  of  the  probability  of  positives 
and  the  conditional  mean  of  the  zero-truncated  density.  Therefore,  using  a  Poisson  re¬ 
gression  when  the  hurdle  model  is  the  correct  specification  implies  a  misspecification, 
which  will  lead  to  inconsistent  estimates.  Because  of  the  form  of  the  conditional  mean 
specification,  the  calculation  of  marginal  effects  is  more  complicated,  with  similarities 
to  the  two-part  model  used  in  Section  16.4. 


With-Zeros  or  Zero-Inflated  Model 

A  second  modified  count  model  is  the  with-zeros  model  or  zero-inflated  model.  This 
supplements  a  count  density  /i(-)  with  a  binary  process  with  density  fi(-).  If  the  binary 
process  takes  value  0,  with  probability  /i(0),  then  y  =  0.  If  the  binary  process  takes 
value  1,  with  probability  /i(l),  then  y  takes  count  values  0,  1.  2, . . .  from  the  count 
density  /of-).  This  lets  zero  counts  occur  in  two  ways:  as  a  realization  of  the  binary 
process  and  as  a  realization  of  the  count  process  when  the  binary  random  variable  takes 
value  1.  The  density  is 


,  j/i(0)  +  (l-/i(0))/2(0)  if  y  =  0,  (20.19) 

8  y  (1  -  /i(0))/2(y)  if  y  >  I- 

Regression  models  let  /fi  )  be  a  logit  model  and  /i(-)  be  a  Poisson  or  negative  bi¬ 
nomial  density.  This  model  is  used  much  less  than  the  hurdle  model.  It  is  capable  of 
modeling  too  few  zeros. 

The  zero-inflated  count  model  is  used  less  frequently  in  econometrics  than  in  other 
statistical  disciplines. 
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20.4.6.  Discrete  Choice  Models 

Count  data  can  be  modeled  by  discrete  choice  model  methods,  possibly  after  some 
grouping  of  counts  to  limit  the  number  of  categories.  For  example,  the  categories  may 
be  0,  1 ,  2,  3,  and  4  or  more  if  few  observations  exceed  four.  Unordered  models  such 
as  multinomial  logit,  discussed  in  Section  15.4,  are  not  parsimonious  and  more  impor¬ 
tantly  are  inappropriate.  Instead,  a  sequential  model  that  recognizes  the  ordering  of  the 
data  should  be  used. 

One  such  model  is  an  ordered  model.  This  defines  an  unobserved  latent  variable, 
y*  =  x'(3  +  u,  with  values  of  y  =  0,  1, 2, . . .  being  observed  as  y*  crosses  progres¬ 
sively  higher  thresholds,  which  are  also  parameters  to  be  estimated.  An  ordered  logit 
(or  probit)  model  arises  when  it  is  logistic  (or  standard  normal)  distributed.  Ordered 
models  (see  Section  15.9)  are  particularly  useful  when  the  count  can  also  take  nega¬ 
tive  values  as  may  occur  when  modeling  a  net  change,  such  as  the  net  change  in  the 
number  of  firms  in  an  industry. 

Another  possible  sequential  model,  although  less  parsimonious,  is  obtained  by  spec¬ 
ifying  a  sequence  of  binary  models  for  Pr[y  =  l|y  >  0],  Pr[y  =  2 1  y  >  1],  and  so  on. 

Finally,  in  some  cases  durations  may  be  available  in  addition  to  counts.  For  example, 
if  the  dates  of  doctor  visits  are  known,  one  can  model  a  count,  the  number  of  visits  in 
a  month,  say,  or  the  duration  of  time  between  visits.  In  general,  the  latter  approach 
is  more  efficient,  since  it  uses  more  detailed  data,  but  the  count  regression  can  still 
provide  useful  information  about  the  role  of  covariates  (Dean  and  Balshaw,  1997). 


20.5.  Partially  Parametric  Models 

By  partially  parametric  models  we  mean  that  we  focus  on  modeling  the  data  via  the 
conditional  mean  and  variance,  and  even  these  may  not  be  fully  specified.  In  Sec¬ 
tion  20.5.1  we  consider  models  based  on  specification  of  the  conditional  mean  and 
variance.  In  Section  20.5.2  we  consider  and  critique  the  use  of  least-squares  methods 
that  do  not  explicitly  model  the  heteroskedasticity  inherent  in  count  data.  In  Section 
20.5.3  we  consider  models  that  are  even  more  partially  parametric,  such  as  those  giving 
an  incomplete  specification  of  the  conditional  mean. 

The  approach  is  similar  in  flavor  to  NLS,  except  that  here  we  allow  for  het¬ 
eroskedasticity  that  is  well  modeled  as  a  function  of  the  conditional  mean. 


20.5.1.  Quasi-ML  Estimation 

As  discussed  in  Section  20.2.1,  when  using  PML  or  QML,  the  distribution  of  the  es¬ 
timator  is  obtained  under  weaker  assumptions  about  the  dgp  than  those  that  lead  to  a 
specific  likelihood  function. 

Let  us  reconsider  (20.6).  Given  an  assumption  for  the  functional  form  for  o>,  ,  and 
a  consistent  estimate  To,  of  (o, ,  one  can  consistently  estimate  this  covariance  ma¬ 
trix.  We  could  use  the  Poisson  assumption,  <y,  =  //,,  but  as  already  noted  the  data 
are  often  overdispersed,  with  &>,  >  /x, .  Common  variance  functions  used  are  <y,  = 
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(1  +  a ix that  of  the  NB2  model  discussed  in  Section  20.4.2,  and  o>,  =  (1  +  a)/Xj, 
that  of  the  NB1  model.  Note  that  in  the  latter  case  (20.6)  simplifies  to  VPML[/3P]  = 
(1  +  a)  M/X,-x')  1 ,  so  with  overdispersion  (a  >  0)  the  usual  ML  variance  matrix 
given  in  (20.7)  understates  the  true  variance. 

If  (Oj  =  E[(vi  —  x'/3)2|x,  |  is  instead  unspecified,  a  consistent  estimate  of  VPML[/3p] 
can  be  obtained  by  adapting  the  Eicker-White  robust  sandwich  variance  estimate 
fonnula  to  this  case.  The  middle  sum  in  (20.6)  needs  to  be  estimated.  If  /x,  -4- 
fii  then  N  1  £T(y,-  —  /x,  )2x,x'  — >  lim  N  1  JT  w, x, x' .  Thus  a  consistent  estimate  of 
VPml[/3p]  is  given  by  (20.6)  with  ox,  and  /x,  replaced  by  (y,  —  /x,) 2  and  /x,. 

When  doubt  exists  about  the  form  of  the  variance  function,  the  use  of  the  PML  es¬ 
timator  is  recommended.  Computationally  this  is  essentially  the  same  as  Poisson  ML, 
with  the  qualification  that  the  variance  matrix  must  be  recomputed.  The  calculation  of 
robust  variances  is  often  an  option  in  standard  packages. 

These  results  for  Poisson  PML  estimation  are  qualitatively  similar  to  those  for  PML 
estimation  in  the  linear  model  under  normality.  They  extend  more  generally  to  PML 
estimation  based  on  densities  in  the  linear  exponential  family.  In  all  cases  consistency 
requires  only  correct  specification  of  the  conditional  mean  (Nelder  and  Wedderburn, 
1972;  Gourieroux  et  al.,  1984a).  This  has  led  to  a  vast  statistical  literature  on  gener¬ 
alized  linear  models  (see  McCullagh  and  Nelder,  1989).  These  permit  valid  inference 
providing  the  conditional  mean  is  correctly  specified  and  nest  many  types  of  data  as 
special  cases  -  continuous  (normal),  count  (Poisson),  discrete  (binomial),  and  positive 
(gamma)  as  detailed  in  Section  5.7.4.  Many  methods  for  complications,  such  as  time- 
series  and  panel  data  models,  are  presented  within  the  more  general  GLM  framework 
rather  than  specifically  for  count  data. 

Some  econometricians  find  it  more  natural  to  use  the  GMM  framework  rather  than 
GLM.  Then  the  starting  point  is  the  conditional  moment  E[y,-  —  exp(xj/3)|x,]  =  0.  If 
data  are  independent  over  i  and  the  conditional  variance  is  a  multiple  of  the  mean  it  can 
be  shown  that  the  optimal  choice  of  instrument  is  x, ,  leading  to  the  estimating  equa¬ 
tions  (20.5);  for  more  detail,  see  Cameron  and  Trivedi  (1998,  pp.  37-44).  The  GMM 
framework  has  been  fruitful  for  panel  data  on  counts  (see  Section  20.5.3)  and  for  en¬ 
dogenous  regressors.  Fully  specified  parametric  simultaneous  equations  models  for 
counts  are  in  their  infancy,  so  instrumental  variables  methods  are  appealing.  Given 
instruments  z dim(z)  >  dim(x),  satisfying  L [ y,  —  exp(x-/3)|z,-  ]  =  0,  a  consistent  esti¬ 
mator  of  / 3  minimizes 


G(/3)  = 


N 

^(>’,  -  exp(x'/3))z, 

i=  1 


w 


N 

-  exp(x-/3))z; 

1  =  1 


(20.20) 


where  W  is  a  symmetric  weighting  matrix. 

The  pros  and  cons  of  this  approach  are  as  follows.  A  major  advantage  is  that  the 
approach  makes  fewer  distributional  assumptions  and  hence  avoids  a  possible  model 
misspecification.  However,  the  discreteness  in  the  outcome  variable  and  its  natural  het- 
eroskedasticity  are  ignored,  leading  to  a  possible  loss  of  efficiency.  A  suitable  of  choice 
of  W  matrix  may  mitigate  the  problem.  Further,  by  emphasizing  the  first  moment  of 
the  distribution,  when  potentially  there  may  be  significant  additional  information  in 
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the  higher  moments,  the  IV  estimator  may  be  sensitive  to  the  presence  of  large  counts 
in  the  data.  Table  20.2  illustrates  features  of  some  types  of  data  that  are  awkward  to 
model  using  a  GMM-type  estimator. 


20.5.2.  Least-Squares  Estimation 

When  attention  is  focused  on  modeling  just  the  conditional  mean,  least-squares  meth¬ 
ods  are  inferior  to  the  approach  of  the  previous  section. 

Linear  least-squares  regression  of  y  on  x  leads  to  consistent  parameter  estimates  if 
the  conditional  mean  is  linear  in  x.  However,  for  count  data  the  specification  E[y  |x]  = 
x'(3  is  inadequate  as  it  permits  negative  values  of  E[y|x].  For  similar  reasons  the  linear 
probability  model  is  inadequate  for  binary  data. 

Transformations  of  y  may  be  considered.  In  particular,  the  logarithmic  transforma¬ 
tion  regresses  In  y  on  x.  This  transformation  is  problematic  if  the  data  contain  zeros, 
as  is  often  the  case.  One  standard  solution  is  to  add  a  constant  term,  such  as  0.5,  and  to 
model  ln(y  +  .5)  by  OLS.  This  ad  hoc  method  introduces  problems  of  retransforma¬ 
tion  if  we  are  interested  in  E[y  |x]  rather  than  E[ln  y  |x] ;  see  Mullahy  (1998).  However, 
conversion  to  a  linear  model  has  the  advantage  of  convenience  if,  for  example,  there  is 
an  endogenous  right-hand  variable  that  needs  to  be  “instrumented.” 

It  is  instead  better  to  use  nonlinear  least  squares  with  the  exponential  mean  specifi¬ 
cation;  that  is,  estimate  the  nonlinear  regression  model  y  =  exp(x'/3)  +  u.  It  is  impor¬ 
tant  that  statistical  inference  for  the  NLS  estimator  be  based  on  Eicker-White  robust 
standard  errors  since  the  error  term  in  this  regression  will  be  heteroskedastic. 

For  counts  the  NLS  estimator  is  generally  less  efficient  than  the  Poisson  pseudo- 
MLE.  The  NLS  first-order  condition  is  JVly,  —  exp(x-/3))exp(xj/3)x,  =  0.  This 
weights  the  residuals  differently  than  does  the  Poisson  pseudo-MLE  (see  (20.5)).  The 
NLS  weights  are  optimal  if  V[y,  |x,]  is  constant  (homoskedastic)  whereas  the  Poisson 
pseduo-MLE  weights  are  optimal  if  V[y,  |x,]  is  a  multiple  of  E[y,jx,].  The  latter  is  a 
much  better  model  for  handling  the  inherent  heteroskedasticity  of  count  data. 


20.5.3.  Semiparametric  Models 

By  semiparametric  models  we  mean  partially  parametric  models  that  have  an  infinite¬ 
dimensional  component,  as  developed  in  Section  9.7.  The  curse  of  dimensionality  mo¬ 
tivates  us  to  put  some  structure  on  the  conditional  mean  function. 

One  class  of  semiparametric  models  incompletely  specifies  the  conditional  mean. 
Leading  examples  are  single-index  models  and  partially  linear  models.  Single-index 
models  specify  /x,  =  g(x'i/3),  where  the  functional  form  g(-)  is  left  unspecified.  Par¬ 
tially  linear  models  specify  /x,-  =  exp(x'/3  +  g( z,)),  where  the  functional  form  g(-)  is 
left  unspecified.  In  both  cases  \Z~N -consistent  asymptotically  normal  estimators  of  (3 
can  be  obtained,  without  knowledge  of  g(-). 

A  second  example  is  optimal  estimation  of  the  regression  parameters  (3,  when  /x,  = 
cxp(x'/3)  is  assumed  but  V[y,jx,]  =  to,  is  left  unspecified.  The  infinite-dimensional 
component  arises  because  as  N  -*  oo  there  are  infinitely  many  variance  parameters 
u>i.  An  optimal  estimator  of  (3,  called  an  adaptive  estimator,  is  one  that  is  as  efficient 
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as  that  when  is  known.  Delgado  and  Kniesner  (1997)  extend  results  for  the  linear  re¬ 
gression  model  to  count  data  with  exponential  conditional  mean  function,  using  kernel 
regression  methods  to  estimate  weights  to  be  used  in  a  second-stage  nonlinear  least- 
squares  regression.  In  their  application  the  estimator  shows  little  gain  over  specifying 
coj  =  fij(  1  +  cy/i,),  overdispersion  of  the  NB2  form. 


20.6.  Multivariate  Counts  and  Endogenous  Regressors 

In  this  section  we  very  briefly  present  extension  from  cross-section  to  other  types  of 
count  data  (see  Cameron  and  Trivedi,  1998,  for  further  detail).  For  multivariate  count 
data  many  models  have  been  proposed  but  preferred  methods  have  not  yet  been  estab¬ 
lished.  For  related  panel  data  there  is  more  agreement  in  the  econometrics  literature  on 
which  methods  to  use,  though  a  wider  range  of  models  is  considered  in  the  statistics 
literature;  see  Section  23.7. 


20.6.1.  Multivariate  Data 

In  some  data  sets  more  than  one  count  is  observed.  For  example,  data  on  the  utiliza¬ 
tion  of  several  different  types  of  health  service,  such  as  doctor  visits  and  hospital  days, 
may  be  available.  Joint  modeling  will  improve  efficiency  and  provide  richer  models 
of  the  data  if  counts  are  correlated.  This  section  briefly  reviews  bivariate  count  mod¬ 
els  related  to  the  main  models  of  this  chapter.  The  reader  familiar  with  multiequation 
linear  models  with  correlated  errors,  e.g.  the  SUR  model  in  Section  6.9.3,  may  think 
of  a  generalization  to  multiequation  count  models  with  correlated  errors.  Assume  that 
we  observe  several  count  variables  for  the  same  individual  (e.g.,  number  of  visits  to 
a  doctor  and  number  of  prescribed  medications  taken).  The  source  of  correlation  may 
lie  in  unobserved  heterogeneity.  Joint  estimation  that  takes  account  of  correlated  er¬ 
rors  will  yield  more  efficient  estimates,  but  at  the  cost  of  additional  computational 
complexity. 


Semiparametric  Methods 

A  partially  parametric  approach  views  this  as  a  seemingly  unrelated  regressions  prob¬ 
lem,  adapting  methods  for  the  linear  regression  model  to  count  data  where  the  condi¬ 
tional  means  are  nonlinear  and  the  data  are  heteroskedastic;  see  Section  6.10.3. 

Gourieroux,  Monfort,  and  Trognon  (1984b)  propose  a  moment-based  approach  to 
derive  the  bivariate  Poisson-type  model.  They  specify  a  model  by  defining  first  two 
moments  of  yq  and  yo  and  estimate  it  by  a  quasi-generalized  pseudo-maximum  like¬ 
lihood  procedure.  This  model  allows  for  overdispersion  and  is  more  general  than  the 
bivariate  Poisson  model,  but  it  does  not  maintain  the  integer-valued  property  of  the 
counts. 

Delgado  (1992)  treats  a  multivariate  count  model  as  a  multivariate  nonlinear  model 
and  suggests  a  semiparametric  generalized  least-squares  estimator.  The  covariance  ma¬ 
trix  of  the  residuals  is  estimated  using  the  &-NN  method.  The  approach  differs  from 
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that  of  Gourieroux,  Monfort,  and  Trognon  (1984)  in  the  choice  of  the  estimator  for  the 
covariance  matrix. 

Most  parametric  studies  have  used  the  bivariate  Poisson.  One  way  this  distribution 
is  derived  is  to  suppose  that  the  two  counts  y\  and  y2  are  generated  as  V|  =  "j  +  w 
and  y2  =  7.2  +  u-’,  where  all  of  zi,  Z2,  and  w  are  independent  and  Poisson  distributed, 
with  positive  parameters  Ai,  A2,  and  A 12,  respectively,  which  may  be  parameterized  as 
a  function  of  exogenous  covariates.  This  is  called  the  method  of  trivariate  reduction. 

The  marginal  distribution  of  v;  is  Poisson]  A;  +  A.  12]  and,  therefore,  this  model  re¬ 
stricts  the  conditional  mean  to  be  equal  to  the  conditional  variance  for  each  count 
variable,  so 


HI.V/lX.I  =  V|  V;  X.  I 


(20.21) 


for  j  =  1,2,  where  x ;  is  a  vector  of  explanatory  variables.  The  correlation  coefficient 
is  given  by 


Cor[yi,y2] 


M2 

V(M  +  M2)(A2  +  M2) 


(20.22) 


which  is  positive,  because  A 12  >  0. 


Fully  Parametric  Methods 


Several  recent  studies  develop  better  parametric  models  by  introducing  correlated 
unobserved  heterogeneity  for  each  count.  The  related  issues  were  discussed  in  Sec¬ 
tions  6.10.1  and  19.3. 

Marshall  and  Olkin  (1990)  consider  a  model  with  multiplicative  unobserved  het¬ 
erogeneity  in  the  marginal  distributions  of  both  counts  in  the  following  way.  Let  y; 
be  V  [A  j  v] ,  j  =  1 ,  2,  where  V  denotes  Poisson  distribution  with  mean  A  j  v  and  v  has 
gamma  distribution  with  density 

'’"_1  exp(— v) 


gffi)  = 


r(«) 


The  random  variable  v  can  be  interpreted  as  common  (shared)  unobserved  hetero¬ 
geneity.  The  resulting  model  is  a  one-factor  model.  The  bivariate  negative  binomial 
(BVNB)  distribution  of  two  counts  is  defined  as 


/(yi,y2|xi,x2) 


■jf 

■/ 


/i(yi|xi,  v)/2(y2|x2,  v)g{v)clv 


(20.23) 
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exp(— A  j  v)(A  j  v)yi 
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yi!y2!r(a) 

_X\ 

1 

+ 

(N 

+ 

.M 

+  a2  +  1  _ 

1 


_  Ai  +  A2  +  1  _ 

This  mixture  has  a  closed-form  solution,  but  the  model  restricts  the  unobserved  het¬ 
erogeneity  to  be  the  identical  component  for  both  count  variables.  The  joint  likelihood 
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is  built  up  with  terms  like  (20.23).  The  marginal  distributions  are  univariate  negative 
binomial  and  the  correlation  between  the  two  count  variables. 

Cor  [ylj2]  =  - ,  (20.24) 

yj  (kj  +  aklXkj  +  OlX2) 


must  be  positive. 

Other  models  with  more  flexible  correlation  structures,  but  that  also  require 
computationally  advanced  methods,  have  been  proposed  by  Cameron  and  Johansson 
(1998),  Munkin  and  Trivedi  (1999),  and  Chib  and  Winkelmann  (2001). 

Munkin  and  Trivedi  (1999)  consider  a  generalization  of  the  BVNB  model  as 
follows: 


/(yi,.V2|Xi,X2) 


/lCytlxi,  Vi)/2(y2|x2,  v2)g(vi,  v2)dvldv2, 


(20.25) 


where  the  joint  distribution  is  built  up  from  the  two  marginal  models,  each  condi¬ 
tioned  on  a  separate  unobserved  heterogeneity  variable,  v\  and  v2,  respectively,  that  are 
specified  to  gave  a  bivariate  normal  distribution.  Conditional  on  (xi,x2,  iq,  v2)  each 
marginal  distribution  is  Poisson,  with  multiplicative  unobserved  normal  heterogene¬ 
ity.  The  model  is  therefore  a  bivariate  Poisson-log-normal  mixture.  The  likelihood 
function  is  the  product  over  the  sample  of  terms  like  (20.25).  The  authors  interpret 
this  as  a  “two-factor  model.”  This  specification  is  more  flexible  as  it  does  not  restrict 
the  sign  or  size  of  correlation  between  the  two  unobserved  components.  However  this 
additional  flexibility  introduces  computational  complexity  because  the  bivariate  inte¬ 
gral  in  (20.25)  does  not  have  an  analytical  solution  and  hence  must  be  handled  us¬ 
ing  a  simulation-based  approach  (discussed  in  Chapter  12).  2.4  and  in  Munkin  and 
Trivedi  (1999).  If  the  dimension  of  the  model,  the  number  of  y  variables,  increases, 
then  so  does  the  order  of  numerical  integration  involved.  This  feature  combined  with 
a  possibly  large  sample  size  can  make  computational  burden  very  significant.  Chib 
and  Winkelmann  (2001)  suggest  an  alternative  Bayesian  MCMC  approach,  which, 
while  retaining  the  flexibility  of  the  aforementioned  specification,  can  handle  a  high¬ 
dimensional  outcome  vector.  They  demonstrate  the  feasibility  of  their  approach  with  a 
six-dimensional  mixed  Poisson-log-normal  model. 

Another  recently  developed  approach  to  modeling  correlated  counts  is  the  cop¬ 
ula  approach  described  in  Section  19.3.  Here  one  begins  with  the  specification  of 
marginal  distributions;  the  joint  distribution  is  obtained  by  combining  the  marginals 
using  a  copula.  Examples  for  dependent  durations  were  given  in  Section  19.3.  See  also 
Cameron,  Li,  Trivedi,  and  Zimmer  (2004). 


20.6.2.  Count  Models  with  Endogenous  Regressors 

Simultaneous  models  for  count  variables  arise  in  a  number  of  contexts.  For  example, 
in  Cameron  et  al.  (1988)  the  focus  is  on  a  count  variable  (medical  utilization),  but  one 
of  the  covariates,  the  health  insurance  status  of  the  subject,  is  an  endogenous  choice. 
Mullahy  (1997)  in  a  cross-section  context,  and  Crepon  and  Duguet  (1997b)  in  a  panel 
data  context,  apply  the  GMM  approach  to  count  models  with  endogenous  regressors.  A 
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very  well  known  example  from  health  economics  involves  models  of  counts  of  health 
services,  such  as  doctor  visits,  and  one  of  the  regressors  would  be  the  health  insurance 
status  of  the  individual.  The  assumption  that  the  choice  of  health  insurance  and  the 
error  on  the  outcome  equation  are  uncorrelated  is  unrealistic,  and  hence  the  insurance 
regressor  is  likely  to  be  endogenous.  Chapter  22  provides  more  examples  and  details 
of  panel  count  models  with  endogenous  regressors. 

Currently  the  econometric  literature  provides  two  approaches  to  the  estimation  of 
models  with  endogenous  regressors:  one  based  on  the  GMM/IV  approach  and  the  other 
based  on  stronger  assumptions  of  maximum  likelihood.  We  consider  each  in  turn. 

The  first  approach  (Mullahy,  1997)  begins  with  a  moment  condition.  Consider  the 
exponential  mean  model  with  additive  zero-mean  error  term, 


y,  =  E[y,-  |x(]  +  Vi  =  exp(x'/3)  +  v,- ,  (20.26) 

E[v,|x,]#0.  (20.27) 

Suppose  that  we  have  available  instrumental  variables  z,  that  satisfy  the  moment 
conditions 


E[iy  |zj  ]  =  0, 
E[.v,-  -  exp(X;/3)|z;]  =  0. 


(20.28) 


Then  the  GMM  or  nonlinear  IV  estimation  is  feasible,  assuming  that  there  are  enough 
moment  conditions  available.  This  approach  has  already  been  discussed  in  Sec¬ 
tion  6.5.3.  The  reader  is  referred  to  this  section  for  details  and  related  discussion. 
However,  note  that  in  implementing  this  approach  the  count  nature  of  the  variable  is 
ignored  and  the  model  is  treated  like  any  other  nonlinear  model  with  an  exponential 
mean.  Also,  note  that  heteroskedasticity  is  highly  likely  with  counted  data  and  hence 
the  GMM/IV  procedure  should  accommodate  this  complication. 

Mullahy  has  pointed  out  that  a  multiplicative  error  term  specification  has  certain 
advantages.  This,  however,  leads  to  a  different  moment  condition.  Let 


E[yi|x/,  =  exp(x'/3)v;. 


(20.29) 


This  leads  to  the  moment  condition 


yi 

exp(x'/3) 


l|z  i 


=  0, 


(20.30) 


which  is  a  special  case  of  the  nonlinear  moment  condition  E[r(y,  ,  x,  ,  /3)|z,]  =  0  dis¬ 
cussed  in  Section  6.5.  Provided  suitable  and  sufficient  moment  conditions  are  avail¬ 
able,  the  GMM  approach  can  be  followed.  Once  again,  however,  for  a  counted  variable, 
heteroskedasticity  is  likely  and  efficiency  loss  will  occur  because  the  count  feature  of 
the  variable  has  been  ignored. 

Alternative  approaches  that  simultaneously  handle  the  count  feature  of  the  depen¬ 
dent  variable  and  the  problem  of  endogenous  regressors  are  more  parametric  (Terza, 
1998).  Deb  and  Trivedi  (2004)  develop  a  joint  model  of  counts  (Y)  with  insurance 
plan  variable  ( D )  as  regressors  and  a  binary  choice  model  for  the  insurance  plan. 
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Endogeneity  in  their  model  arises  from  the  presence  of  correlated  unobserved  hetero¬ 
geneity  in  the  outcome  (count)  equation  and  the  binary  choice  equation.  Their  model 
has  the  following  structure: 


Pr[T,  =  y.-fc,  Dj ,  lj ]  =  /(X;/3  +  y,  D,  +  XI,),  (20.31) 

Pr[  A  =  l|z  =  g(z;Q  +  81,),  (20.32) 

where  /,  are  latent  factors  reflecting  unobserved  heterogeneity  and  8  and  X  are  an 
associated  factor  loadings.  The  joint  distribution  of  selection  and  outcome  variables, 
conditional  on  the  common  latent  factors,  can  be  written  as 

Pr[T,  =  yi,  A  =  1  |x, ,  z, ,  /, ]  =  /(x'/3  +  yid;  +  Xl^gtya  +  <5/,),  (20.33) 

because  (Y,  D)  are  assumed  to  be  conditionally  independent. 

The  problem  in  estimation  arises  because  the  /,  are  unknown.  Although  the  /,  are 
unknown,  assume  that  h,  the  distribution  of  ,  is  known  and  can  therefore  be  integrated 
out  of  the  joint  density,  that  is, 

Pr[T;  =  Vi,  A  =  lfc,  z,]  =  /  +  y ,  A  +  Xl,)g( [zja  +  8l,)\  hQddU.  (20.34) 

Cast  in  this  form,  the  unknown  parameters  of  the  model  may  be  estimated  by  maximum 
likelihood. 

For  simplicity  we  assume  /;(/,  )  has  no  unknown  parameters.  Then  the  maximum 
likelihood  estimator  maximizes  the  joint  likelihood  function  L(#i,  #2 1  Vi,  A,  x, ,  z,), 
where  6\  =  (j3,  y\,  X)  and  02  =  ( a ,  <5)  refer  to  parameters  in  the  outcome  and  plan 
choice  equations,  respectively,  and  L  refers  to  the  joint  likelihood  whose  ith  compo¬ 
nent  is  defined  in  (20.34).  For  identification  additional  normalization  restrictions  may 
be  needed. 

The  main  practical  problem  of  estimation  given  suitable  specifications  for  /,  g, 
and  h  is  that  the  integral  does  not  have,  in  general,  a  closed-form  solution.  The  MSF 
estimator  involves  replacing  the  expectation  by  a  simulated  sample  analogue  (average), 
that  is, 

Pr[F«  =  y;,  A  =  lfc,  z,]  =  \  t  [MP  +  Ki  A  +  fiis)g( zja  +  8%s)] .  (20.35) 

5  s=l 

where  lis  is  the  ,vth  draw  (from  a  total  of  S  draws)  of  a  pseudo-random  number  from 
the  density  h  and  Pr  denotes  the  simulated  probability.  A  simulated  likelihood  function 
for  the  data  can  then  be  defined.  The  MSL  estimator  maximizes  the  simulated  log- 
likelihood. 

This  approach,  developed  for  an  endogenous  dummy  regressor  in  a  count  regres¬ 
sion  model,  can  be  extended  to  multiple  dummies,  and  multiple  outcomes,  whether 
discrete  or  continuous.  The  limitation  comes  from  the  burden  of  estimation,  which  is 
very  heavy  compared  with  an  IV-type  estimator.  Further,  as  in  any  simultaneous  equa¬ 
tion  model,  identifiability  is  an  issue.  Applied  work  typically  includes  some  nontrivial 
explanatory  variables  in  the  z  vector  that  are  excluded  from  the  x  vector. 
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20.7.  Count  Example:  Further  Analysis 

We  now  reconsider  the  earlier  analysis  based  on  the  Poisson  regression  by  using  more 
flexible  parametric  models  beginning  with  the  NB2  model. 

The  results  for  the  NB2  model  are  given  in  the  last  columns  of  Table  20.5,  presented 
in  Section  20.3.  Here  too  we  report  the  robust  standard  errors  and  r- rati  os.  First  note 
that  the  overdispersion  coefficient  a  is  highly  significant.  The  Wald  test  statistic  is 
8.926,  leading  to  a  decisive  rejection  of  the  null  of  equidispersion  (a  =  0).  Consistent 
with  this  is  the  large  increase  in  the  log-likelihood,  from  —60,087  to  —42,777.  Clearly, 
the  improvement  in  the  fit  of  the  model  is  considerable.  Because  the  models  are  nested 
it  is  unnecessary  to  report  AIC  and  BIC. 

Row  3  in  Table  20.6  shows  the  predicted  frequencies  from  the  NB2  model.  These 
are  very  close  to  the  observed  frequencies  and  confirm  the  improvement  in  the  fit  of 
the  model  as  a  result  of  overdispersion  being  accounted  for. 

The  coefficients  themselves,  however,  seem  fairly  stable  among  alternative  estima¬ 
tion  methods,  and  all  effects  are  measured  with  precision,  reflecting  the  impact  of 
the  large  sample.  These  features  of  the  results  are  encouraging,  suggesting  that  the 
NB2  model  is  reasonable.  As  predicted  by  basic  economic  theory,  utilization  and  the 
coinsurance  rate  (LC)  are  negatively  correlated.  The  estimated  impact  does  not  seem 
sensitive  to  the  treatment  of  overdispersion. 

Additional  modeling  refinements  are  possible.  For  example,  Deb  and  Trivedi  (2002) 
compare  the  performance  of  the  two-part  (hurdle)  model  with  a  two-component  finite 
mixture  model  and  find  the  latter  to  fit  better.  However,  even  the  hurdle  model  fits  better 
than  the  NB2  model.  Although  such  refinements  provide  additional  information,  none 
of  the  results  given  here  can  be  regarded  as  misleading  on  the  essential  question  of 
price  sensitivity  of  utilization. 

The  NB2  model  works  well  for  doctor  visits.  For  other  count  outcomes,  however, 
even  more  flexible  models  than  NB2  may  be  necessary. 


20.8.  Practical  Considerations 

Those  with  experience  of  nonlinear  least  squares  will  find  it  easy  to  use  packaged 
software  for  Poisson  regression,  which  is  a  widely  available  option  in  popular  econo¬ 
metrics  and  statistics  packages.  Care  is  needed  to  ensure  that  robust  standard  errors 
are  obtained.  Many  econometrics  packages  also  include  negative  binomial  regression 
and  the  basic  panel  data  models.  Popular  statistics  packages  include  count  regres¬ 
sion  in  a  generalized  linear  models  module.  Standard  packages  also  produce  some 
goodness-of-fit  statistics,  such  as  the  pseudo-R2  measures,  for  the  Poisson  model 
see  Section  8.7.1. 

More  recently  developed  models,  such  as  finite  mixture  models,  most  time-series 
models,  and  dynamic  panel  data  models,  require  developing  one’s  own  programs.  A 
promising  route  is  to  use  matrix  programming  languages  in  conjunction  with  soft¬ 
ware  for  implementing  estimation  based  on  user-defined  objective  functions.  For 
simple  models  many  computer  programs  make  it  possible  to  implement  maximum 
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likelihood  estimation  and  (highly  desirable)  robust  variance  estimation  for  user- 
defined  functions. 

In  addition  to  reporting  parameter  estimates  it  is  useful  to  have  an  indication  of 
the  magnitude  of  the  estimated  effects,  as  discussed  in  Section  20.2.3.  As  noted  in 
Section  20.2.4,  care  should  be  taken  to  ensure  that  reported  standard  errors  and  t- 
statistics  for  the  Poisson  regression  model  are  based  on  variance  estimates  robust  to 
overdispersion. 

In  addition  to  estimation  it  is  strongly  recommended  that  specification  tests  be  used 
to  assess  the  adequacy  of  the  estimated  model.  For  Poisson  cross-section  regression 
overdispersion  tests  are  easy  to  implement.  For  any  parametric  model  one  can  compare 
the  actual  and  fitted  frequency  distribution  of  counts,  although  it  is  not  always  easy  to 
understand  the  respect  in  which  a  model  fails  when  the  distribution  of  observed  counts 
is  highly  dispersed.  Formal  statistical  specification  and  goodness-of-fit  tests  based  on 
actual  and  fitted  frequencies  are  available. 

In  most  practical  situations  one  is  likely  to  face  the  problem  of  model  selection. 
For  likelihood-based  models  that  are  nonnested  one  can  use  selection  criteria,  such 
as  the  Akaike  information  criteria,  that  are  based  on  the  fitted  log-likelihood  but  with 
degrees-of-freedom  penalty  for  models  with  many  parameters. 


20.9.  Bibliographic  Notes 

20.2  All  the  topics  dealt  with  in  this  chapter  are  treated  at  greater  length  and  depth 
by  Cameron  and  Trivedi  (1998),  who  also  provide  a  comprehensive  bibliography. 
Winkelmann  (1997)  also  provides  a  treatment  of  the  econometric  literature  on  counts. 
The  statistics  literature  generally  analyzes  counts  in  the  context  of  GLM.  The  stan¬ 
dard  reference  is  McCullagh  and  Nelder  (1989).  The  econometrics  literature  gener¬ 
ally  underemphasizes  the  contributions  of  the  GLM  literature.  Fahrmeier  and  Tutz 
(1994)  provide  a  recent  and  more  econometric  exposition  of  GLMs.  The  material  in 
Section  20.2  is  standard  and  appears  in  many  places. 

20.3  Deb  and  Trivedi  (2002)  give  a  detailed  analysis  of  these  RHIE  data. 

20.4  Cameron  and  Trivedi  (1986)  provide  an  early  presentation  and  application  of  the 
negative  binomial.  Hausman  et  al.  (1984)  applied  the  model  and  its  variants  to  panel 
data.  For  the  finite  mixture  approach  of  Section  20.4.3  see  Deb  and  Trivedi  (1997). 
Applications  of  the  hurdle  model  in  Section  20.4.5  include  those  by  Mullahy  (1986), 
who  first  proposed  the  model,  Pohlmeier  and  Ulrich  (1995),  and  Gurmu  and  Trivedi 
(1996). 

20.5  The  quasi-MLE  of  Section  20.5. 1  is  presented  in  detail  by  Gourieroux  et  al.  (1984a, b) 
and  by  Cameron  and  Trivedi  ( 1986). 

20.6  Regression  models  for  the  types  of  data  discussed  in  Section  20.6  are  in  their  infancy. 
The  notable  exception  is  that  (static)  panel  data  count  models  are  well  established, 
with  the  standard  reference  being  Hausman  et  al.  (1984).  See  also  Brannas  and  Jo¬ 
hansson  (1996).  Developing  adequate  regression  models  for  multivariate  count  data 
and  models  with  endogenous  regressors  is  currently  an  active  area;  see  Terza  (1998), 
and  Deb  and  Trivedi  (2004). 
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Exercises 


20-1  Suppose  that  Y  is  Poisson  distributed  with  mean  /z. 

(a)  Verify  that  the  first  four  moments  are,  respectively,  /z,  /x,  /z,  and  3/z2  +  /z. 

(b)  Show  that  there  is  a  linear  relationship  between  Pr[V  =  ;']  and  Pr[/=  j  - 

1].  7  =  1.2 . 

(c)  Consider  the  Poisson  MLE  in  the  regression  case  with  /z,  =  exp(xt/3). 
Possible  estimates  of  the  variance  of  the  Poisson  MLE  include  V[/3]  = 
E/ M/X/X/]-1  and  V[/3]  =  [£,(y  - /z,)2X/X-]_1 .  Show  that  they  are  asymp¬ 
totically  equivalent  (upon  scaling  by  N)  if  the  data  density  is  correctly 
specified. 


20-2  Now  consider  overdispersion  in  the  Poisson  model. 

(a)  Suppose  Y\g~V[ji],  where  /z  =  exp(/J0  +  £ix),  A)  =  Ko  +  £,  and  e  is 
an  unobserved  random  variable  with  E[e]  =  0,  V[e]  =  a2  >  0.  Show  that 
V[V]>  E[Y1 

(b)  Consider  the  NB2  model  with  the  variance  function  /x  +  a/z2  and  the  proba¬ 
bility  mass  function  given  in  (20.12).  Using  graphs  for  four  different  values  of 
a  e  [0,  3] ,  describe  the  behavior  of  the  probability  mass  for  different  realized 
values  of  Y;  in  your  answer  concentrate  on  the  behavior  of  the  function  near 
the  origin  and  in  the  right  tail. 

(c)  For  the  NB2  density  given  in  (20.12)  in  Section  20.4.1 ,  show  that  as  a  0 
the  density  goes  to  the  Poisson.  [This  could  be  tricky.] 


20-3  Consider  the  Poisson  regression  model  with  conditional  mean  /z  =  exp(x'/3). 
Treat  the  estimation  problem  as  an  unweighted  nonlinear  squares  problem  in 
which  y  =  E[y|x]  +  e,  where  E[y|x]  =  exp(x'/3)  and  e  ~  iid[0,  a2]. 


(a)  Derive  the  nonlinear  least-squares  equations  for  (f3,  a2).  Compare  the  least- 
squares  and  the  maximum  likelihood  equations  for  (3  and  explain  the  differ¬ 
ence  between  them. 

(b)  Derive  the  weighted  nonlinear  least-squares  equations  for  (3.  Explain  your 
choice  of  weights.  [Weights  are  used  to  handle  heteroskedasticity]. 

(c)  Compare  the  weighted  nonlinear  least-squares  and  the  maximum  likelihood 
equations  and  explain  the  similarities,  if  any. 

20-4  Consider  a  finite  mixture  density  f(y\0)  =  ni  fj(y\0j)>  an  additive  mixture 
of  C  distinct  latent  classes,  or  subpopulations,  with  unknown  mixing  proportions 
n-\, ,  i rc,  where  ni  =  1  >  71  i  >  0.  Here  y  is  a  count  variable,  and  the  yth 
component  density  /y(y|0y)  for  the  /th  observation  is  expressed  as 


fj  (y<)  = 


r  (y  +  if  ji) 

r{ifji)r(yi  +  i) 


ifji 


kji  +  ifji 


fjl 


Ji 


^ ji  +  ifji 


y> 


where  Xj,  =  exp(x)/3;),  fji  =  X^/aj,  ay  >  0  and  d y  =  (/3y ,  ay).  Here  k  is  either  0 
or  1.  This  model  is  the  finite  mixture  negative  binomial  with  C  components  and 
specializes  to  the  finite  mixture  Poisson  if  ay  =  0. 

(a)  Show  that  E[y  |x,]  =  1/  =  Xj^ji  and  V(y  |x()  =  £yli  ^y^,[1  + 

“  ji*]  T-  Xj  —  Xj . 

(b)  Show  that  any  mixture  model  based  on  the  first  moment  alone  is  not 
identified. 
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(C)  Show  that  the  C-component  Poisson  mixture  based  on  the  first  two  mo¬ 
ments  is  identified. 

20-5  (Adapted  from  Baltagi  and  Li,  1999)  A  simple  test  of  overdispersion  in  a  Pois¬ 
son  model  given  in  Section  20.2.4  tests  the  null  hypothesis  of  zero  coefficient 
in  the  regression  of  [(y,  -  Jlj)2  -  y;]//x,  on  'jli.  An  alternative  test  proposed  in 
the  literature  (Baltagi  and  Li,  1999)  involves  the  same  test  but  is  based  on  the 
regression  of  ( y,  -  g,)2  on  /x, .  The  latter  can  be  motivated  by  the  idea  of  tests 
based  on  the  Gauss-Newton  regression,  (see  Section  10.3.9).  Analyze  the  dif¬ 
ferences  between  the  tests  and  the  implications  of  the  differences  for  the  manner 
of  implementing  the  second  test. 

20-6  For  this  problem  use  a  50%  subsample  of  the  data  used  in  this  chapter. 

(a)  Estimate  Poisson  and  negative  binomial  regression  with  MDU  as  the  de¬ 
pendent  variable  and  the  following  explanatory  variable:  LC,  IDP,  LINC, 
FEMALE,  EDUDEC,  XAGE,  BLACK,  HLTHG,  HLTHF,  and  HLTHP.  Carry  out 
a  likelihood  ratio  test  of  the  null  hypothesis  that  the  variables  LC  and  IDP 
have  no  effect  on  MDU. 

(b)  Test  for  overdispersion  in  the  Poisson  regression  using  the  variance  formula¬ 
tions  (20.9)  with  g(n)  =  /x  and  (20.10)  with  gi/x)  =  /x2  in  this  chapter.  Which 
version  of  the  variance  formulation  gets  more  support  from  the  data?  What 
do  you  conclude  from  this  exercise? 

(c)  Estimate  the  negative  binomial  model  (NB2).  Compare  the  estimate  of  the 
overdispersion  parameter  with  that  in  part  (b).  Explain  the  similarities  and 
differences. 

(d)  Using  the  results  from  the  negative  binomial  estimation,  compare  the 
estimated  marginal  effect  of  a  change  in  LC  for  an  average  individual 
in  excellent  health  (baseline)  and  an  average  individual  in  poor  health 
(HLTHP  =  1). 

(e)  For  this  Poisson  specification  estimate  the  “hurdle  version”  consisting  of  a 
zero  part  (logit  or  probit)  and  a  positive  part  (truncated-at-zero  Poisson). 
Compare  these  results  with  those  from  a  regular  Poisson  model.  Analyze 
the  similarities  and  differences  between  the  implications  of  the  two  models. 
Based  on  your  analysis,  which  model  do  you  regard  as  a  better  explanation 
of  the  data? 
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Cross-section  models  have  certain  inherent  limitations.  They  are  predominantly  equi¬ 
librium  models  that  generally  do  not  shed  light  on  intertemporal  dependence  of  events. 
They  also  cannot  satisfactorily  resolve  fundamental  issues  about  the  sources  of  per¬ 
sistence  in  behavior.  Such  persistence  may  be  behavioral,  i.e.  arising  from  true  state 
dependence,  or  it  may  be  spurious,  being  an  artifact  of  the  inability  to  control  for  het¬ 
erogeneous  behavior  in  the  population.  Because  panel  data,  also  called  longitudinal 
data,  contain  periodically  repeated  observations  of  the  same  subjects,  they  have  a  large 
potential  for  resolving  issues  that  cross-section  models  cannot  satisfactorily  handle. 
Chapters  21  through  23  present  methods  for  panel  data.  We  progress  systematically 
from  linear  models  for  continuous  data  in  Chapter  21  to  nonlinear  panel  data  models 
for  limited  dependent  variables  in  Chapter  23.  Both  fixed  effects  and  random  effects 
models  are  considered.  A  persistent  theme  through  these  three  chapters  is  the  impor¬ 
tance  of  using  panel-robust  methods  of  inference. 

Chapter  21,  which  reviews  the  key  general  results  for  linear  panel  data  regression 
models,  can  be  read  easily  by  those  with  a  good  grasp  of  linear  regression;  it  does  not 
require  the  material  covered  in  Parts  2  to  4.  We  recommend  that  even  those  who  are 
interested  in  more  advanced  material  should  quickly  peruse  through  the  contents  of 
this  chapter  first  to  gain  familiarity  with  key  concepts  and  definitions. 

Chapter  22  covers  important  extensions  of  Chapter  21,  especially  to  dynamic  panels 
which  allow  for  Markovian  dependence  structure  of  current  variables.  The  analysis  is 
in  the  GMM  framework  that  is  currently  favored  by  many  practitioners  in  this  area. 
The  analysis  here  is  at  times  intricate,  involving  many  issues  of  detail.  A  strong  grasp 
of  GMM  will  be  helpful  in  absorbing  the  main  results  of  this  chapter. 

The  results  of  Chapters  21  and  22  do  not  extend  to  nonlinear  panel  models  of  Chap¬ 
ter  23  in  a  general  and  unified  fashion.  There  are  relatively  fewer  general  results  for 
limited  dependent  variable  panel  models.  Despite  this,  in  Chapter  23  we  begin  by 
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presenting  an  analysis  of  some  general  issues  and  approaches.  Later  sections  of  this 
chapter  present  panel  data  extensions  of  the  counterpart  cross-section  models  studied 
in  Part  4.  These  sections  analyze  four  categories  of  models  for  binary,  count,  censored, 
and  duration  data,  respectively,  and  should  be  accessible  to  a  suitably  prepared  reader 
familiar  with  the  parallel  cross-section  models. 
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CHAPTER  21 


Linear  Panel  Models:  Basics 


21.1.  Introduction 

Panel  data  are  repeated  observations  on  the  same  cross  section,  typically  of  individu¬ 
als  or  firms  in  microeconomics  applications,  observed  for  several  time  periods.  Other 
terms  used  for  such  data  include  longitudinal  data  and  repeated  measures.  The  focus 
is  on  data  from  a  short  panel,  meaning  a  large  cross  section  of  individuals  observed  for 
a  few  time  periods,  rather  than  a  long  panel  such  as  a  small  cross  section  of  countries 
observed  for  many  time  periods. 

A  major  advantage  of  panel  data  is  increased  precision  in  estimation.  This  is  the 
result  of  an  increase  in  the  number  of  observations  owing  to  combining  or  pooling 
several  time  periods  of  data  for  each  individual.  However,  for  valid  statistical  infer¬ 
ence  one  needs  to  control  for  likely  correlation  of  regression  model  errors  over  time 
for  a  given  individual.  In  particular,  the  usual  formula  for  OLS  standard  errors  in  a 
pooled  OLS  regression  typically  overstates  the  precision  gains,  leading  to  underesti¬ 
mated  standard  errors  and  r-stat i sties  that  can  be  greatly  inflated. 

A  second  attraction  of  panel  data  is  the  possibility  of  consistent  estimation  of  the 
fixed  effects  model,  which  allows  for  unobserved  individual  heterogeneity  that  may 
be  correlated  with  regressors.  Such  unobserved  heterogeneity  leads  to  omitted  vari¬ 
ables  bias  that  could  in  principle  be  corrected  by  instrumental  variables  methods  using 
only  a  single  cross  section,  but  in  practice  it  can  be  difficult  to  obtain  a  valid  instru¬ 
ment.  Data  from  a  short  panel,  with  as  few  as  two  periods,  offers  an  alternative  way 
to  proceed  if  the  unobserved  individual-specific  effects  are  assumed  to  be  additive  and 
time-invariant. 

Most  disciplines  in  applied  statistics  other  than  microeconometrics  treat  any  unob¬ 
served  individual  heterogeneity  as  being  distributed  independently  of  the  regressors. 
Then  the  effects  are  called  random  effects,  though  a  better  term  is  purely  random  ef¬ 
fects.  Compared  to  fixed  effects  models  this  stronger  assumption  has  the  advantage 
of  permitting  consistent  estimation  of  all  parameters,  including  coefficients  of  time- 
invariant  regressors.  However,  random  effects  and  pooled  estimators  are  inconsistent 
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if  the  true  model  is  one  with  fixed  effects.  Economists  often  view  the  assumptions  for 
the  random  effects  model  as  being  unsupported  by  the  data. 

A  third  attraction  of  panel  data  is  the  possibility  of  learning  more  about  the  dynam¬ 
ics  of  individual  behavior  than  is  possible  from  a  single  cross  section.  Thus  a  cross 
section  may  yield  a  poverty  rate  of  20%  but  we  need  panel  data  to  determine  whether 
the  same  20%  are  in  poverty  each  year.  As  a  related  example,  panel  data  may  determine 
whether  high  serial  correlation  of  individual  earnings  or  unemployment  spell  length  is 
due  to  an  individual  specific  tendency  to  have  high  earnings  or  a  long  unemployment 
spell,  or  whether  it  is  a  consequence  of  having  past  high  earnings  or  unemployment. 
This  topic  is  deferred  to  Chapter  22. 

The  linear  panel  data  models  and  associated  estimators  are  conceptually  simple, 
aside  from  the  fundamental  issue  of  whether  or  not  fixed  effects  are  necessary.  The 
considerable  algebra  used  to  derive  the  properties  of  panel  data  estimators  should  not 
distract  one  from  an  understanding  of  the  basics:  The  statistical  properties  of  panel 
data  estimators  vary  with  the  assumed  model  and  its  treatment  of  unobserved  effects. 
Furthermore,  much  of  the  algebra  does  not  generalize  to  nonlinear  panel  models. 

The  current  chapter  presents  the  basic  estimators  for  various  linear  panel  data  mod¬ 
els.  A  lengthy  introduction  in  Sections  21.2  and  21.3  provides,  respectively,  the  com¬ 
monly  used  models  and  estimators  and  an  application  to  the  relationship  between  an¬ 
nual  hours  worked  and  wages.  The  important  distinction  between  fixed  and  random 
effects  models  is  studied  in  Section  21.4.  Sections  21.5-21.7  present  additional  detail 
on  estimation  for,  respectively,  pooled  models,  individual-specific  fixed  effects  mod¬ 
els,  and  individual-specific  random  effects  models.  Section  21.8  considers  other  basic 
aspects  such  as  inference  and  prediction  in  linear  panel  data  models. 

21.2.  Overview  of  Models  and  Estimators 

Panel  data  provide  information  on  individual  behavior  both  across  time  and  across 
individuals. 

Even  for  linear  regression,  standard  panel  data  analysis  uses  a  much  wider  range  of 
models  and  estimators  than  is  the  case  with  cross-section  data.  Several  standard  models 
are  presented  in  Section  21.2.1,  followed  by  several  estimators  presented  in  Section 
21.2.2.  Table  21.1  gives  a  summary  that  also  indicates  that  several  of  the  estimators 
are  inconsistent  if  the  dgp  is  the  individual-specific  fixed  effects  model. 

Obtaining  correct  standard  errors  of  estimators  is  also  more  complicated  than  in 
the  cross-section  case.  One  needs  to  control  for  correlation  over  time  in  errors  for  a 
given  individual,  in  addition  to  possible  heteroskedasticity.  This  topic  is  covered  in 
Section  21.2.3. 


21.2.1.  Panel  Data  Models 

A  very  general  linear  model  for  panel  data  permits  the  intercept  and  slope  coefficients 
to  vary  over  both  individual  and  time,  with 

yu  —  au  +X;,/3;r  +  Mir,  1  =  1 - ,N,  t=l,...,T, 
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Table  21.1.  Linear  Panel  Model:  Common  Estimators  and  Models a 


Estimator  of  f3 

Assumed  Model 

Pooled 

(21.1) 

Random  Effects 
(21.3)  and  (21.5) 

Fixed  Effects 
(21.3)  Only 

Pooled  OLS  (21.1) 

Consistent 

Consistent 

Inconsistent 

Between  (21.7) 

Consistent 

Consistent 

Inconsistent 

Within  (or  Fixed  Effects)  (21.8) 

Consistent 

Consistent 

Consistent 

First  Differences  (21.9) 

Consistent 

Consistent 

Consistent 

Random  Effects  (21.10) 

Consistent 

Consistent 

Inconsistent 

a  This  table  considers  only  consistency  of  estimators  of  /3.  For  correct  computation  of  standard  errors  see  Sec¬ 
tion  21.2.3. 


where  yit  is  a  scalar  dependent  variable,  xit  is  a  K  x  1  vector  of  independent  variables, 
uu  is  a  scalar  disturbance  term,  i  indexes  individual  (or  firm  or  country)  in  a  cross 
section,  and  t  indexes  time. 

This  model  is  too  general  and  is  not  estimable  as  there  are  more  parameters  to 
estimate  than  observations.  Further  restrictions  need  to  be  placed  on  the  extent  to  which 
an  and  fiit  vary  with  i  and  t,  and  on  the  behavior  of  the  error  Ujt. 


Pooled  Model 

The  most  restrictive  model  is  a  pooled  model  that  specifies  constant  coefficients,  the 
usual  assumption  for  cross-section  analysis,  so  that 

yit  =  a  +  x'itf3  +  uit.  (21.1) 

If  this  model  is  correctly  specified  and  regressors  are  uncorrelated  with  the  error  then 
it  can  be  consistently  estimated  using  pooled  OLS.  The  error  term  is  likely  to  be  cor¬ 
related  over  time  for  a  given  individual,  however,  in  which  case  the  usual  reported 
standard  errors  should  not  be  used  as  they  can  be  greatly  downward  biased.  Further¬ 
more,  the  pooled  OLS  estimator  is  inconsistent  if  the  fixed  effects  model,  defined  in 
the  following,  is  appropriate. 


Individual  and  Time  Dummies 

A  simple  variant  of  the  model  (21.1)  permits  intercepts  to  vary  across  individuals  and 
over  time  while  slope  parameters  do  not.  Then  y,r  =  a,  +  yt  +  x!it(3  +  Ujt,  or 

N  T 

yu  =  jdj.it  +  ^2  Ysds.it  +  x-,/3,  (21.2) 

1=1  s=2 

where  the  N  individual  dummies  dj.n  equal  one  if  i  =  j  and  equal  zero  otherwise, 
the  (T  —  1)  time  dummies  dSJ,  equal  one  if  t  =  s  and  equal  zero  otherwise,  and  it  is 
assumed  that  x,r  does  not  include  an  intercept.  (If  an  intercept  is  included  then  one  of 
the  N  individual  dummies  must  be  dropped). 
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This  model  has  N  +  (T  —  1)  +  dim[x]  parameters  that  can  be  consistently  esti¬ 
mated  if  both  N  — >  oo  and  T  oo.  We  focus  on  short  panels  where  N  — »■  oo  but  T 
does  not.  Then  the  y,  can  be  consistently  estimated,  so  the  (T  —  1)  time  dummies  are 
simply  incorporated  into  the  regressors  xit.  The  challenge  then  lies  in  estimating  the 
parameters  /3  controlling  for  the  N  individual  intercepts  a,  .  One  possibility  is  to  in¬ 
stead  have  dummies  for  groups  of  observations,  such  as  grouping  by  region,  in  which 
case  the  clustering  methods  of  Chapter  24  are  relevant.  Here  instead  we  specify  a  full 
set  of  N  individual  intercepts,  which  causes  problems  as  N  — »•  oo. 


Fixed  Effects  and  Random  Effects  Models 

The  individual-specific  effects  model  allows  each  cross-sectional  unit  to  have  a  dif¬ 
ferent  intercept  term  though  all  slopes  are  the  same,  so  that 

yit  —  oi  i  +x'jt(3  +  sit,  (21.3) 

where  e(/  is  iid  over  i  and  t.  This  is  a  more  parsimonious  way  to  express  model  (21.2), 
with  any  time  dummies  included  in  the  regressors  x,( .  The  a,  are  random  variables  that 
capture  unobserved  heterogeneity,  already  studied  in  Sections  18.2-18.5  and  20.4. 

Throughout  this  chapter  we  make  the  assumption  of  strong  exogeneity  or  strict 
exogeneity 


E[e  it\ai,xn - -  xr/]  =  0.  t  =  \,...,T,  (21.4) 

so  that  the  error  term  is  assumed  to  have  mean  zero  conditional  on  past,  current,  and 
future  values  of  the  regressors.  Chamberlain  (1980)  gives  a  detailed  discussion  of  ex¬ 
ogeneity  assumptions  and  tests  for  exogeneity  for  panel  data.  Strong  exogeneity  rules 
out  models  with  lagged  dependent  variables  or  with  endogenous  variables  as  regres¬ 
sors;  these  models  are  deferred  to  Chapter  22. 

One  variant  of  the  model  (21.3)  treats  a,  as  an  unobserved  random  variable  that  is 
potentially  correlated  with  the  observed  regressors  x(/.  This  variant  is  called  the  fixed 
effects  (FE)  model  as  early  treatments  modeled  these  effects  as  parameters  ai, ...  ,a^ 
to  be  estimated.  If  fixed  effects  are  present  and  correlated  with  xit  then  many  estima¬ 
tors  such  as  pooled  OLS  are  inconsistent.  Instead,  alternative  estimation  methods  that 
eliminate  the  a,  are  needed  to  ensure  consistent  estimation  of  /3  in  a  short  panel. 

The  other  variant  of  the  model  (21.3)  assumes  that  the  unobservable  individual  ef¬ 
fects  a,  are  random  variables  that  are  distributed  independently  of  the  regressors.  This 
model  is  called  the  random  effects  (RE)  model,  which  usually  makes  the  additional 
assumptions  that 

«,■  ~  [a.  a~]  ,  (21.5) 

Si,  ~  [0,  a;]  , 

so  that  both  the  random  effects  and  the  error  term  in  (21.3)  are  assumed  to  be  iid.  Note 
that  no  specific  distributions  have  been  specified  in  (2 1 .5).  A  more  precise  term  for  this 
model  is  the  one-way  individual-specific  random  effects  model,  or  more  simply  the 
random  intercept  model,  to  distinguish  the  model  with  more  general  random  effects 
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models  such  as  the  mixed  linear  models  presented  in  Section  22.8.  Yet  another  name 

is  the  random  components  model. 

The  term  fixed  effect  is  potentially  misleading  and  the  term  random  effect  is  more 
precisely  a  purely  random  effect.  To  avoid  such  confusion,  M-J.  Lee  (2002)  calls  a 
fixed  effect  a  “related  effect”  and  a  random  effect  an  “unrelated  effect.”  We  use  the 
traditional  notation  and  terminology,  but  it  should  be  clear  that  o',  is  a  random  variable 
in  both  fixed  and  random  effects  models. 


Equicorrelated  Model 


The  RE  model  can  be  viewed  as  a  specialization  of  the  pooled  model,  as  the  a,  can 
be  subsumed  into  the  error  term.  Then  (21.3)  can  be  viewed  as  regression  of  y,,  on  x,r 
with  composite  error  term  =  a,  +  e,-,,  and  (21.5)  implies  that 


Cov[(a,-  +  sit),  (a,-  +  £,■*)] 


+  a}. 


/  /  s, 
t  —  s. 


(21.6) 


The  RE  model  therefore  imposes  the  constraint  that  the  composite  error  u„  is  equicor¬ 
related,  since  Corfu, r,  uis]  =  er 2 / [ q2  +  cr2]  for  f  /  j  does  not  vary  with  the  time  dif¬ 
ference  t  —  s.  Clearly,  pooled  OLS  will  be  consistent  but  inefficient  under  the  RE 
model.  The  random  effects  model  is  also  called  the  equicorrelated  model  or  ex¬ 
changeable  errors  model. 


Fixed  versus  Random  Effects  Models 

The  fundamental  distinction  is  between  models  with  and  without  fixed  effects.  The 
modern  econometrics  literature  emphasizes  fixed  effects,  but  we  also  provide  details 
for  the  random  effects  model. 

Some  authors,  including  Chamberlain  (1980,  1984)  and  Wooldridge  (2002),  use  the 
notation 


yit  =  Ci  +  x'it/3  +  Si, 

in  (21.3)  to  make  it  very  clear  that  the  individual  effect  is  a  random  variable  in  both 
fixed  and  random  effects  models.  Both  models  assume  that 

E[>’,rlc,  x,,]  =  a  +  xf,/3. 

The  individual-specific  effect  c,  is  unknown  and  in  short  panels  cannot  be  consis¬ 
tently  estimated,  so  we  cannot  estimate  E[v,,|c,,  x,,].  Instead,  we  can  eliminate  c,  by 
taking  the  expectation  with  respect  to  q,  leading  to 

E[>’,(|x„]  =  E[C;|X;,]  +  x',/3. 

For  the  RE  model  it  is  assumed  that  E[c,  |x„]  =  a,  so  E[  v,;|x,f]  =  a  +  x-r/3  and  hence 
it  is  possible  to  identify  E[>q  |x,, J.  In  the  FE  model,  however,  E[c,  |x,,]  varies  with 
Xj,  and  it  is  not  known  how  it  varies,  so  we  cannot  identify  E[y,,  |x,,  |.  It  is  nonethe¬ 
less  possible  to  consistently  estimate  /3  in  the  FE  model  with  short  panels  (as  will  be 
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discussed  in  the  following).  Thus  it  is  possible  in  the  FE  model  to  identify  the  marginal 
effect 


(3  =  9E[ yit\Ci,xir]/dXi,, 

even  though  the  conditional  mean  is  not  identified.  For  example,  it  is  possible  to  iden¬ 
tify  the  effect  on  earnings  of  an  additional  year  of  schooling,  controlling  for  individual 
effects,  even  though  the  individual  effects  and  the  conditional  mean  are  not  identified. 

In  short  panels  the  FE  model  permits  only  identification  of  the  marginal  effect 
3E[y(,|c/,  x(-,]/3x,f,  and  even  then  only  for  time-varying  regressors,  so  the  marginal 
effect  of  race  or  gender,  for  example,  is  not  identified.  The  RE  model  permits  iden¬ 
tification  of  all  components  of  (3  and  of  E[v,;|x;f],  but  the  key  RE  assumption  that 
E[c,  |x,r]  is  constant  is  viewed  as  untenable  in  many  microeconometrics  applications. 


21.2.2.  Panel  Data  Estimators 

We  now  introduce  several  commonly  used  panel  data  estimators  of  (3,  with  further 
detail  provided  in  Sections  21.5-21.7.  The  estimators  differ  in  the  extent  to  which 
cross-section  and  time-series  variation  in  the  data  are  used,  and  their  properties  vary 
according  to  whether  or  not  the  fixed  effects  model  is  the  appropriate  model. 

A  regressor  xa  may  be  either  time-invariant,  with  xit  =  x,  for  /  =  1, . . . ,  T,  or 
time-varying.  For  some  estimators,  notably  the  within  and  first  differences  estimators 
defined  in  the  following,  only  the  coefficients  of  time- varying  regressors  are  identified. 


Pooled  OLS 

The  pooled  OLS  estimator  is  obtained  by  stacking  the  data  over  i  and  t  into  one  long 
regression  with  NT  observations,  and  estimating  by  OLS 

yit  =  a +  x'it/3  +  un,  i  =  1, - N,  t=l,...,T. 

If  Co v[m,7,  x,,]  =  0  then  either  N  oo  or  T  — »■  oc  is  sufficient  for  consistency. 

The  pooled  OLS  estimator  is  clearly  consistent  if  the  pooled  model  (21.1)  is  ap¬ 
propriate  and  regressors  are  uncorrelated  with  the  error  term.  The  usual  OLS  variance 
matrix  based  on  iid  errors,  however,  is  not  appropriate  here  as  the  errors  for  a  given 
individual  i  are  almost  certainly  positively  correlated  over  t.  The  NT  correlated  obser¬ 
vations  have  less  information  than  NT  independent  observations. 

To  understand  this  correlation,  note  that  for  a  given  individual  we  expect  consid¬ 
erable  correlation  in  y  over  time,  so  that  Cor[  y,-( ,  yis]  is  high.  Even  after  inclusion  of 
regressors  Corfu,,,  u!S  I  may  remain  nonzero,  and  it  often  can  still  be  quite  high.  For 
example,  if  a  model  overpredicts  individual  earnings  in  one  year  it  may  also  overpre¬ 
dict  earnings  for  the  same  individual  in  other  years.  The  RE  model  accommodates  this 
correlation,  with  Corfu,,,  uis]  =  o2 / [ o2  +  a2 ]  for  t  ^  s  from  (21.6). 

The  usual  OLS  output  treats  each  of  the  T  years  as  independent  pieces  of  informa¬ 
tion,  but  the  information  content  is  less  than  this  given  the  positive  error  correlation. 
This  leads  to  overstatement  of  estimator  precision  that  can  be  very  large,  as  illustrated 
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in  Section  21.3.2  and  formally  demonstrated  in  Section  21.5.4.  One  therefore  needs  to 
use  panel-corrected  standard  errors  (see  Section  21.2.3)  whenever  OLS  is  applied  in 
a  panel  setting.  Many  corrections  are  possible,  depending  on  the  correlation  and  het- 
eroskedasticity  structure  assumed  for  the  errors  and  whether  the  panel  is  short  or  long 
(see  Section  21.5). 

The  pooled  OLS  estimator  is  inconsistent  if  the  true  model  is  the  fixed  effects 
model.  To  see  this,  rewrite  the  model  (21.3)  as 

yit  —  a  +  x',/3  +  (a,  -  a  +  £;,). 

Then  pooled  OLS  regression  of  y,f  on  x,;  and  an  intercept  leads  to  an  inconsistent 
estimator  of  (3  if  the  individual  effect  a,  is  correlated  with  the  regressors  x,,,  since 
such  correlation  implies  that  the  combined  error  term  (o',  —  a  +  e,r)  is  correlated  with 
the  regressors. 

In  summary,  pooled  OLS  is  appropriate  if  the  constant-coefficients  or  random  ef¬ 
fects  models  are  appropriate,  but  panel-corrected  standard  errors  and  r-statistics  must 
be  used  for  statistical  inference.  Pooled  OLS  is  inconsistent  if  the  fixed  effects  model 
is  appropriate. 


Between  Estimator 

The  pooled  OLS  estimator  uses  variation  over  both  time  and  cross-sectional  units  to 
estimate  /3. 

The  between  estimator  in  short  panels  instead  uses  just  the  cross-sectional  variation. 
Begin  with  the  individual-specific  effects  model  (21.3).  Averaging  over  all  years  yields 
y,  =  a;  +  x-/3  +  Sj,  which  can  be  rewritten  as  the  between  model 

y,-  =  a  +  %/3  +  (a,-  —  a  +  £;),  i  —  1, . . . ,  N,  (21.7) 

where  yt  =  J'1  yu,  h  =  T~l  J2,  £it,  and  x,-  =  T~l  J2,  *it- 

The  between  estimator  is  the  OLS  estimator  from  regression  of  y,  on  an  intercept 
and  X;.  It  uses  variation  between  different  individuals  and  is  the  analogue  of  cross- 
section  regression,  which  is  the  special  case  T  =  1. 

The  between  estimator  is  consistent  if  the  regressors  x,  are  independent  of  the  com¬ 
posite  error  (o',  —  a  +  £,)  in  (21.7).  This  will  be  the  case  for  the  constant-coefficients 
model  and  the  random  effects  model.  In  contrast,  for  the  fixed  effects  model  the  be¬ 
tween  estimator  is  inconsistent  as  a,  is  then  assumed  to  be  correlated  with  xit  and 

hence  x, . 


Within  Estimator  or  Fixed  Effects  Estimator 

The  within  estimator  is  an  estimator  that,  unlike  the  pooled  OLS  or  between  estimators, 
exploits  the  special  features  of  panel  data.  In  a  short  panel  it  measures  the  association 
between  individual-specific  deviations  of  regressors  from  their  time-averaged  values 
and  individual-specific  deviations  of  the  dependent  variable  from  its  time-averaged 
value.  This  is  done  using  the  variation  in  the  data  over  time. 
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Specifically,  begin  with  the  individual-specific  effects  model  (21.3),  which  nests 
(21.1)  as  the  special  case  a,  =  a.  Then  taking  the  average  over  time  yields  y,  =  a,  + 
x'/3  +  e(.  Subtracting  this  from  yit  in  (21.3)  yields  the  within  model 

yu  ~  yi  —  (Xj  t  -  x,)'/3  +  (£,-,-  £/),  i  =  1 . 2V,  t=l,...,T,  (21.8) 

as  the  a,  terms  cancel. 

The  within  estimator  is  the  OLS  estimator  in  (21.8).  A  special  feature  of  this  esti¬ 
mator  is  that  it  yields  consistent  estimates  of  (3  in  the  fixed  effects  model,  whereas  the 
pooled  OLS  and  between  estimators  do  not. 

From  Section  21.6  the  within  estimator  has  several  interpretations.  It  is  called  the 
fixed  effects  estimator  as  it  is  the  efficient  estimator  of  (3  in  the  model  (21.3)  if  a, 
are  fixed  effects  and  the  error  s,r  is  iid.  This  chapter  focuses  on  a  literature  that  treats 
fixed  effects  as  nuisance  parameters  that  can  be  ignored  since  interest  lies  solely  in 
estimation  of  (3.  If  instead  the  fixed  effects  are  of  interest  they  can  also  be  estimated. 
In  short  panels  these  estimates  of  the  individual  a,  are  inconsistent,  though  their  distri¬ 
bution  or  their  variation  with  a  key  variable  may  be  informative.  If  N  is  not  too  large 
an  alternative  and  simpler  way  to  compute  the  within  estimator  is  by  least-squares 
dummy  variable  estimation.  This  directly  estimates  (21.2)  by  OLS  regression  of  yit 
on  xit  and  the  N  individual  dummy  variables  and  yields  the  within  estimator  for  (3, 
along  with  estimates  of  the  N  fixed  effects  (see  Section  21.6.4).  Yet  another  interpreta¬ 
tion  of  the  within  estimator  is  the  covariance  estimator.  Finally,  taking  deviations  from 
individual-specific  averages  is  equivalent  to  taking  residuals  from  auxiliary  regression 
of  yit  and  xit  on  individual  dummies  and  then  working  with  the  residuals. 

A  major  limitation  of  within  estimation  is  that  the  coefficients  of  time-invariant 
regressors  are  not  identified  in  the  within  model,  since  if  xlt  =  Xj  then  x,  =  x,  so 
(xj,  —  Xj)  =  0.  Many  studies  seek  to  estimate  the  effect  of  time-invariant  regressors. 
For  example,  in  panel  wage  regressions  we  may  be  interested  in  the  effect  of  gender  or 
race.  For  this  reason  many  practitioners  prefer  not  to  use  the  within  estimator.  Pooled 
OLS  or  random  effects  estimators  permit  estimation  of  coefficients  of  time-invariant 
regressors,  but  these  estimators  are  inconsistent  if  the  fixed  effects  model  is  the  correct 
model. 


First-Differences  Estimator 

The  first-differences  estimator  also  exploits  the  special  features  of  panel  data.  In  a  short 
panel  it  measures  the  association  between  individual-specific  one-period  changes  in 
regressors  and  individual-specific  one-period  changes  in  the  dependent  variable. 

Specifically,  begin  with  the  individual-specific  effects  model  (21.3).  Then  lagging 
one  period  yields  =  a,  +  x- f_j/3  +  Subtracting  this  from  yit  in  (21.3) 
yields  the  first-differences  model 

yu  —  yu-i  =  (x,7  —  x/,f_i)'/3  +  (sit  —  £;,r_i),  1  =  1 - ,N,  t  =  2, ...  ,T,  (21.9) 

as  the  a,  terms  cancel. 

The  first-differences  estimator  is  the  OLS  estimator  in  (21.9).  Like  the  within  esti¬ 
mator,  this  estimator  yields  consistent  estimates  of  f 3  in  the  fixed  effects  model,  though 
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the  coefficients  of  time-invariant  regressors  are  not  identified.  The  first-differences  es¬ 
timator  is  less  efficient  than  the  within  estimator  for  T  >2  if  £„  is  iid. 


Random  Effects  Estimator 


The  random  effects  estimator  is  an  estimator  that  also  exploits  the  special  features  of 
panel  data. 

Begin  with  the  individual-specific  effects  model  (21.3),  but  assume  a  random  effects 
model  where  a,  and  sit  are  iid  as  in  (21.5).  Pooled  OLS  is  consistent  but  pooled  GLS 
will  be  more  efficient.  The  feasible  GLS  estimator  (see  Section  4.5.1)  of  the  RE 
model,  called  the  random  effects  estimator,  can  be  calculated  from  OLS  estimation 
of  the  transformed  model 


yu  ~  kyi  =  (1  -  X)fi  +  (xit  -  Xxi)'(3  +  %, 


(21.10) 


where  %  =  (1  —  a) a,  +  (su  —  Xsj)  is  asymptotically  iid,  and  X  is  consistent  for 


1=1- 


(21.11) 


Section  21.7  provides  a  derivation  of  (21.10)  and  ways  to  estimate  and  o}  and 
hence  to  estimate  X.  Note  that  1  =  0  corresponds  to  pooled  OLS,  1  =  1  corresponds 
to  within  estimation,  and  1  — >  I  as  T  —>■  oo.  This  is  a  two-step  estimator  of  (3. 

The  RE  estimator  is  fully  efficient  under  the  RE  model,  though  the  efficiency  gain 
compared  to  pooled  OLS  need  not  be  great.  It  is  inconsistent,  however,  if  the  fixed 
effects  model  is  the  correct  model. 


21.2.3.  Panel-Robust  Statistical  Inference 

The  various  panel  models  include  error  terms  denoted  «,,,  and  o', .  In  many  microe¬ 
conometrics  applications  it  is  reasonable  to  assume  independence  over  i.  However,  the 
errors  are  potentially  (1)  serially  correlated  (i.e.,  correlated  over  t  for  given  i)  and/or 
(2)  heteroskedastic.  Valid  statistical  inference  requires  controlling  for  both  of  these 
factors. 

The  White  heteroskedastic  consistent  estimator  of  Section  4.4.5  is  easily  extended 
to  short  panels  since  for  the  ith  observation  the  error  variance  matrix  is  of  finite  dimen¬ 
sion  T  x  T  while  N  — >  oo.  Thus  panel-robust  standard  errors  can  be  obtained  without 
assuming  specific  functional  forms  for  either  within-individual  error  correlation  or  het- 
eroskedasticity.  More  efficient  estimators  using  GMM  are  deferred  to  Section  22.2.3. 

It  is  crucial  to  note  that  frequently  the  panel  commands  in  many  computer  packages 
calculate  default  standard  errors  assuming  iid  model  errors,  leading  to  erroneous  in¬ 
ference.  In  particular,  for  pooled  OLS  regression  of  yit  on  xit  without  any  control  for 
individual  effects  it  is  very  likely  that  Cov[n,;,  n,  v]  >  0  for  t  s.  Ignoring  this  serial 
correlation  can  lead  to  greatly  underestimated  standard  errors  and  over-estimated  t- 
statistics,  as  demonstrated  in  the  Section  21.3  data  example  and  shown  algebraically  in 
Section  21.5.4.  Once  fixed  or  random  individual-specific  effects  are  included  the  serial 
correlation  in  errors  can  be  greatly  reduced,  but  it  may  not  be  completely  eliminated. 
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Additionally,  one  may  need  to  control  for  potential  heteroskedasticity  as  is  routinely 
done  for  cross-section  data. 


Panel-Robust  Sandwich  Standard  Errors 


The  panel  estimators  of  Section  21.2.2  can  be  obtained  by  OLS  estimation  of  9  in  the 
pooled  regression 

yit  =  \v'ite  +  uit,  (21.12) 

where  different  panel  estimators  correspond  to  different  transformations  y,-f,  w,,.  and 
uu  of  Vi i,  w-,  =  [1  x';  ].  and  Ujt.  The  key  is  that  y,,  is  a  known  function  of  only 
yn, . . . ,  yiT,  and  similarly  for  w ,-f  and  m,,. 

In  the  simplest  case  of  pooled  OLS,  no  transformation  is  necessary  and  9  =  [a  0\ . 
For  the  within  estimator  yit  =  yit  —  y,,  w„  =  (xit  —  x,),  where  only  time-varying  re¬ 
gressors  appear,  and  9  equals  the  coefficients  of  the  time-varying  regressors.  For  first- 
differences  estimation  yit  =  yit  —  y,j~i,  w,,  =  (x,;  —  x,- | )  and  again  only  coefffi- 
cients  of  time- varying  regressors  are  identified.  For  random  effects  yit  =  yit  —  ky,  and 
w'it  =  (w a  —  Xwj)  and  9  =  [a  f3']'.  Such  transformations  can  induce  serial  correlation 
even  if  underlying  errors  are  uncorrelated. 

It  is  convenient  to  stack  observations  over  time  periods  for  a  given  individual,  lead¬ 
ing  to 

y i  =  W/0  +  U,;, 


where  y,  is  a  T  x  1  vector  in  the  preceding  examples,  except  for  the  first-differences 
model  where  it  is  (T  —  1)  x  1,  and  W,  is  a  T  x  q  matrix  or,  for  the  first-differences 
model,  a  (T  —  1)  x  q  matrix.  Further  stacking  over  the  N  individuals  yields 

y  =  W0  +  U. 


Three  representations  of  the  OLS  estimator  are  therefore 
?0LS  =  [W'W]  'W'v 


-1 


i=l 
N  T 

EE^ 


i=i  t=  l 


Ew.-'y.- 

N  T 


-i-i 


i=i  i=i 


where  in  the  third  expression  the  sum  is  from  t  =  2  to  T  in  the  case  of  the  first- 
differences  estimator.  The  most  convenient  representation  to  use  varies  with  the 
context. 

To  consider  consistency,  note  that  if  the  model  is  correctly  specified  then  the  usual 
algebra  yields  ?0ls  =  9  +  [W'W]”1  W'u  or 


001. s  =  9 


-i-i 


Ew/w, 


i= 1 


£w/Si- 
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Given  independence  over  i  the  essential  condition  for  consistency  is  E[W,'u, ]  =  0. 
This  generally  requires  a  stronger  assumption  than  E[m;,|w,,]  =  0.  A  sufficient  as¬ 
sumption  is  that  of  strong  exogeneity  given  in  (21.4).  See  Chapter  22  for  estimation 
under  assumptions  weaker  than  strong  exogeneity  that  permit,  for  example,  lagged 
dependent  variables  as  regressors. 

The  asymptotic  variance  of  0qls  is  then 


V[0OLS]  = 


£w/w , 


1  =  1 


-1 


^2  W,'E[u,u'|W,]W, 


1  =  1 


£w/W  ; 


i=  1 


-1 


given  independence  of  errors  over  i.  Consistent  estimation  of  V[#ols]  in  this  panel 
setting  is  analogous  to  the  cross-section  problem  of  obtaining  a  consistent  estimate  of 
V[#ols]  that  is  robust  to  heteroskedasticity  of  unknown  form.  The  only  complication 
is  the  appearance  of  a  vector  u,  rather  than  a  scalar  which  poses  no  problem  if  the 
panel  is  short  as  then  the  dimension  of  u,  is  finite. 

This  leads  to  a  panel-robust  estimate  of  the  asymptotic  variance  matrix  of  the 
pooled  OLS  estimator,  one  that  controls  for  both  serial  correlation  and  heteroskedas¬ 
ticity,  given  by 


V[0ols]  = 


-i-i 


Ew/w, 


i= 1 


J2  W,'u,u' w, 


1=1 


-1-1 


(21.13) 


where  u,  =  u,  =  y,  —  YV,6fi  The  estimator  in  (21.13)  assumes  independence  over  i 
and  N  -a-  oo,  the  case  for  short  panels,  but  otherwise  permits  V[«,r]  and  Cov[n(, ,  uls  \ 
to  vary  with  i,  t,  and  .v.  An  equivalent  expression  is 


NT 

1  N  T  T 

N  T 

V[?OLS]  = 

EE5^. 

i= 1  t=  1 

HHY.  W/fw;sn,fM/s 

i= 1  t=  1  5=1 

i=i  t= l 

where  w,,  =  y,,  —  w 'it0.  This  estimator  was  proposed  by  Arellano  (1987)  for  the  fixed 
effects  estimator. 

Panel-robust  standard  errors  based  on  (21.13)  can  be  computed  by  use  of  a  regular 
OLS  command,  if  the  command  has  a  cluster-robust  standard  error  option  (see  Sec¬ 
tion  24.5.2).  Since  the  clustering  here  is  on  the  individual  one  selects  the  identifier  for 
individual  i  as  the  cluster  variable.  This  method  was  used  to  obtain  the  panel-robust 
standard  errors  given  in  Table  24. 1 . 

The  term  “robust”  standard  error  can  cause  confusion.  A  common  error  made  in 
pooled  regression  is  to  estimate  the  OLS  regression  (21.12)  using  the  standard  robust 
standard  error  option  (see  Section  4.4.5).  However,  this  only  adjusts  for  heteroskedas¬ 
ticity,  and  in  practice  in  a  panel  setting  it  is  much  more  important  to  correct  for  the 
correlation  in  individual  errors.  Another  common  error,  though  one  that  has  smaller 
impact,  is  to  use  cluster-robust  standard  errors  that  assume  homoskedasticity  so  that 
E[u,u'j  is  constant  over  i . 
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Panel  Bootstrap  Standard  Errors 

The  bootstrap  method  provides  an  alternative  way  to  obtain  panel-robust  standard 
errors.  The  key  assumption  is  that  observations  are  independent  over  i,  so  one  does 
a  bootstrap  pairs  procedure  that  resamples  with  replacement  over  i  and  uses  all  ob¬ 
served  time  periods  for  a  given  individual.  For  data  {(y,  ,  X,),  i  =  I .....  /V}  this  yields 
B  pseudo-samples  and  for  each  pseudo-sample  one  performs  OLS  regression  of  yit 
on  w it,  yielding  B  estimates  6b,  b  =  1.  . . .  ,  B. 

The  panel  bootstrap  estimate  of  the  variance  matrix  is  then 

VBoo,[0]  =  J2  ft1’  ~  7))  -  V)  -  (21.14) 

b=  1 

where  6  =  B  1  '}2h  6/,.  This  bootstrap  provides  no  asymptotic  refinement  (see  Sec¬ 
tion  11.2.2).  Given  independence  over  i  the  estimate  is  consistent  as  N  — >  oo.  It  is 
asymptotically  equivalent  to  the  estimate  (21.13),  just  as  in  the  cross-section  case 
bootstrap  pairs  are  asymptotically  equivalent  to  White’s  heteroskedastic  consistent  es¬ 
timate,  This  bootstrap  does  not  offer  an  asymptotic  refinement  though  bootstrap  with 
asymptotic  refinement  is  possible  (see  Section  1 1.6.2). 

This  bootstrap  method  can  be  applied  to  any  panel  estimator  that  relies  on 
independence  over  i  and  N  oo,  including  the  pooled  feasible  GLS  estimators  of 
Section  21.5.2  for  short  panels.  The  key  is  to  resample  over  i  only,  and  not  over  both 
i  and  t. 


Discussion 

The  importance  of  correcting  standard  errors  for  serial  correlation  in  errors  at  the  indi¬ 
vidual  level  cannot  be  overemphasized.  Computer  packages  currently  do  not  automat¬ 
ically  do  this.  Bertrand,  Duflo,  and  Mullainathan  (2004)  illustrate  the  resulting  down¬ 
ward  bias  in  standard  error  computation,  in  the  context  of  difference-in-differences  es¬ 
timation  (see  Section  22.6).  They  find  that  the  panel-robust  and  panel  bootstrap  meth¬ 
ods  work  well,  even  though  in  their  application  with  state-year  data  N  (the  number  of 
states)  is  relatively  small  whereas  the  asymptotic  theory  uses  N  — >  oo. 

The  following  example  (see  Table  21.2)  also  shows  the  importance  of  correcting 
standard  errors  for  any  error  serial  correlation  and  autocorrelation. 

21.3.  Linear  Panel  Example:  Hours  and  Wages 

An  important  issue  in  labor  economics  is  the  responsiveness  of  labor  supply  to  wages. 
The  standard  textbook  model  of  labor  supply  suggests  that  for  people  already  working 
the  effect  of  a  wage  increase  on  labor  supply  is  ambiguous,  with  an  income  effect 
pushing  in  the  direction  of  less  work  offsetting  a  substitution  effect  in  the  direction  of 
more  work. 

Cross-section  analysis  for  adult  males  finds  a  relatively  small  positive  response  to 
hours  worked.  However,  it  is  possible  that  this  association  is  spurious,  merely  reflect¬ 
ing  a  greater  unobserved  desire  to  work  being  positively  associated  with  higher  wages. 


708 


21.3.  LINEAR  PANEL  EXAMPLE:  HOURS  AND  WAGES 


Panel  data  analysis  can  control  for  this,  under  the  assumption  that  the  unobserved  de¬ 
sire  to  work  is  time-invariant.  For  example,  the  within  estimator  does  so  by  measuring 
the  extent  to  which  an  individual  works  above-average  (or  below-average)  hours  in 
periods  with  above-average  (or  below-average)  wages. 

The  data  on  532  males  for  each  of  the  10  years  from  1979  to  1988  come  from  Ziliak 
(1997).  The  variable  of  interest  is  lnhrs,  the  natural  logarithm  of  annual  hours  worked. 
The  single  explanatory  variable  is  lnwg,  the  natural  logarithm  of  hourly  wage.  We 
consider  the  regression  model 

lnhrs, ■,  =  a,-  +  £lnwg„  +  sit, 

where  the  individual-specific  effect  a,  is  simplified  to  a  in  some  models  and  ft  mea¬ 
sures  the  wage  elasticity  of  labor  supply.  The  error  term  sit  is  assumed  to  be  indepen¬ 
dent  over  i,  but  it  may  be  correlated  over  t  for  given  i.  As  noted  we  expect  ft.  the  labor 
supply  elasticity,  to  be  small  and  positive. 

Ziliak  (1997)  additionally  included  a  quadratic  in  age,  number  of  children,  and  an 
indicator  variable  for  bad  health.  These  regressors  and  year  dummies  make  relatively 
small  difference  to  the  estimate  of  ft  and  its  standard  error,  and  for  simplicity  they  are 
omitted  here.  In  Chapter  22  we  consider  more  general  models  that  permit  lnwg  to  be 
endogenous  and  permit  lags  of  lnhrs  to  appear  as  a  regressor. 


21.3.1.  Data  Summary 


For  the  5,320  observations,  the  sample  means  of  lnhrs  and  lnwg  are  respectively  7.66 
and  2.61,  implying  geometric  means  of  2,120  hours  and  $13.60  per  hour.  The  sam¬ 
ple  standard  deviations  are  respectively  0.29  and  0.43,  indicating  considerably  greater 
variability  in  percentage  terms  in  wages  rather  than  hours. 

For  panel  data  it  is  useful  to  know  whether  variability  is  mostly  across  individuals 
or  across  time.  The  total  variation  of  a  series  x-lt  around  its  grand  mean  x  can  be 
decomposed  as 

NT  NT 

-  *)2  =  'ZZ  T>»  -  -t) + (•*<■  -  *)i2 

i=  1  t=  1  1=1  t=  1 

NT  NT 

=  ^z  -  Xif + ^2  y>,  -  x)2, 

i=l  f=l  /= 1  r=l 

as  the  cross-product  term  sums  to  zero.  In  words,  the  total  sum  of  squares  equals 

the  within  sum  of  squares  plus  the  between  sum  of  squares.  This  leads  to  within 
standard  deviation  \w  and  between  standard  deviation  sB,  where 


1 


NT  -  N 


N  T 

iz  iz<x" 

i=i  i=i 


■  Xif 


and 


s 


2  _ 
B 


! 


N  —  1 


^(x,-  -  xf. 


1  =  1 
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Table  21.2.  Hours  and  Wages:  Standard  Linear  Panel  Model  Estimators a 


POLS 

Between 

Within 

First  Diff 

RE-GLS 

RE-MLE 

a 

7.442 

7.483 

7.220 

.001 

7.346 

7.346 

(3 

.083 

.067 

.168 

.109 

.119 

.120 

Robust  se* 

(.030) 

(.024) 

(.085) 

(.084) 

(.051) 

(.052) 

Boot  se 

[.030] 

[-019] 

[.084] 

[.083] 

[.056] 

[.058] 

Default  se 

{.009} 

{.020} 

{.019} 

{.021} 

{.014} 

{.014} 

R2 

.015 

.021 

.016 

.008 

.014 

.014 

RMSE 

.283 

.177 

.233 

.296 

.233 

.233 

RSS 

427.225 

0.363 

259.398 

417.944 

288.860 

288.612 

TSS 

433.831 

17.015 

263.677 

420.223 

293.023 

292.773 

Oy 

.000 

.181 

.161 

.162 

ae 

.283 

.232 

.233 

.233 

X 

0.000 

- 

1.000 

- 

.585 

.586 

N 

5320 

532 

5320 

4788 

5320 

5320 

a  Shown  are  pooled  OLS  (POLS),  between,  within,  first-differences,  random  effects  (RE)  GLS  and  MLE  linear 
panel  regression  of  lnhrs  on  lnwg.  Standard  errors  for  the  slope  coefficients  are  panel  robust  in  parentheses, 
panel  bootstrap  in  square  brackets,  and  default  estimates  that  assume  iid  errors  in  curly  braces.  The  R2,  root 
mean  square  error  (RMSE),  residual  sum  of  squares  (RSS),  total  sum  of  squares  (TSS),  and  sample  size  come 
from  the  appropriate  regression  given  in  Section  21.2.  The  parameter  k  is  defined  after  (21.il). 
b  se,  standard  error. 


The  within  and  between  sample  standard  deviations  are,  respectively,  0.22  and  0.18 
for  lnhrs  and  0. 19  and  0.39  for  lnwg.  The  larger  total  variation  in  wages  compared  to 
hours  is  therefore  due  to  between  individual  variation  being  much  higher  for  wages. 
Within  individuals  the  variation  is  actually  somewhat  smaller  for  wages  than  it  is  for 
hours. 


21.3.2.  Comparison  of  Panel  Data  Estimators 

Table  21.2  summarizes  results  from  application  of  the  standard  panel  estimators  de¬ 
fined  in  Section  21.2.2  to  these  data,  along  with  three  different  estimates  of  the  stan¬ 
dard  errors.  As  detailed  in  the  following,  statistical  inference  should  use  either  the 
panel-robust  standard  error  or  the  panel  bootstrap  standard  error. 

Slope  Parameter  Estimates 

The  estimate  of  the  slope  parameter  differs  across  the  different  estimation  methods. 
The  between  estimate  that  uses  only  cross-section  variation  is  less  than  the  pooled  OLS 
estimate.  The  within  or  fixed  effects  estimate  of  0. 168  is  much  higher  than  the  pooled 
OLS  estimate  of  0.083  and  is  borderline  statistically  significant  using  a  two-tailed  test 
at  5%  and  standard  error  estimate  of  0.084  or  0.085.  The  first-differences  estimate  of 
0.109  is  also  higher  than  that  of  pooled  OLS  but  is  considerably  less  than  the  within 
estimate,  which  also  uses  only  time-series  variation.  The  RE  estimates  of  0.119  or 
0.120  lie  between  the  between  and  within  estimates.  This  is  expected,  as  RE  estimates 
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can  be  shown  to  be  a  weighted  average  of  between  and  within  estimates.  The  two 
RE  estimates  are  very  close  to  each  other  as  here  the  estimates  of  the  variances  er2  and 
ct2  are  similar,  leading  to  similar  values  X  =  0.585  and  X  =  0.586  in  the  regression 
(21.10).  The  RE  estimates  are  surprisingly  less  efficient  than  the  pooled  OLS  estimates, 
a  sign  that  the  RE  model  fails  to  model  the  error  correlation  well. 

Which  estimates  are  preferred?  The  within  and  first-difference  estimators  are  con¬ 
sistent  under  all  models  (pooled,  RE,  and  FE)  whereas  the  other  estimators  are  in¬ 
consistent  under  the  fixed  effects  model.  The  most  robust  estimates  are  therefore  the 
within  or  first-differences  estimates  of  0.168  or  0.109. 

There  is,  however,  an  efficiency  loss  in  using  these  more  robust  estimators,  with 
standard  errors  of  0.83  to  0.85  that  are  much  larger  than  those  from  pooled  OLS  and 
RE  estimates.  A  formal  Hausman  test  (see  Section  21.4.3  for  details  and  discussion) 
can  be  used  to  test  whether  or  not  the  individual  effects  are  fixed.  Given  the  relative 
imprecision  of  estimation  in  this  example,  the  Hausman  test  does  not  reject  the  null 
hypothesis  of  random  effects,  despite  the  large  difference  between  FE  and  RE  esti¬ 
mates.  So  the  more  efficient  random  effects  estimates  could  be  used  here.  Another 
advantage  of  random  effects  estimation  is  that  it  permits  estimation  of  the  coefficients 
of  time-invariant  estimators. 


Standard  Error  Estimation 

We  now  turn  to  comparison  of  the  standard  error  estimates.  From  Section  21.2.3,  in¬ 
ference  should  be  based  on  panel-robust  standard  errors  that  permit  errors  to  be  corre¬ 
lated  over  time  for  a  given  individual  and  to  have  variances  and  covariances  that  differ 
across  individuals.  Also,  as  detailed  in  later  sections,  the  standard  errors  for  estimators 
based  on  deviations  from  means,  such  as  (21.8)  and  (21.10),  need  to  account  for  loss 
of  N  +  K  rather  than  K  degrees  of  freedom. 

The  first  standard  error  estimate  is  computed  by  the  panel-robust  method  given  in 
(21.13),  and  the  second  is  computed  by  the  panel  bootstrap  given  in  (21.14)  with  500 
replications.  For  brevity  these  estimates  are  called  panel  robust,  though  they  are  addi¬ 
tionally  robust  to  heteroskedasticity.  The  two  estimates  are  very  close,  aside  from  the 
random  effects  models  where  the  panel-robust  standard  errors  are  underestimated  be¬ 
cause  they  are  computed  for  the  regression  (21.10),  which  ignores  estimation  error  in  X. 

The  third  standard  error  estimate  is  the  standard  default  computer  output  that  is 
based  on  the  assumption  of  iid  errors.  In  this  example  the  correctly  estimated  standard 
errors  are  a  remarkable  three  to  four  times  as  large  as  the  default  standard  errors.  The 
one  exception  is  the  between  estimator,  an  estimator  with  standard  errors  that  need 
only  correction  for  heteroskedasticity  since  it  uses  only  cross-section  variation. 

For  example,  for  the  pooled  OLS  estimator  of  /J  the  default  standard  error  is  0.09, 
leading  to  incorrect  f-statistic  of  9.07.  The  panel-robust  standard  error  is  a  much 
larger  0.30,  leading  to  correct  f-statistic  of  a  much  smaller  2.83.  Default  standard  er¬ 
rors  assume  independence  of  model  errors  over  f  for  given  i  when  in  practice  they 
are  likely  to  be  positively  correlated.  This  erroneous  assumption  overestimates  the 
benefit  of  additional  time  periods,  leading  to  downward  bias  in  standard  errors  (see 
Section  21.5.4).  Additionally,  ignoring  heteroskedasticity  in  errors  also  leads  to  bias, 
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though  this  bias  could  be  in  either  direction.  For  these  data  a  failure  to  control  for 
heteroskedasticity  also  imparts  a  large  downward  bias:  The  standard  error  of  /3POLS 
controlling  for  heteroskedasticity,  but  not  for  correlation  over  t  for  given  i,  is  0.020. 
For  other  data,  correction  for  heteroskedasticity  is  usually  much  less  important  than 
the  correction  for  panel  correlation. 

For  the  within  and  between  estimators  inclusion  of  the  term  a,  should  control  for 
some  of  the  correlation  in  the  error  across  time  for  a  given  individual.  For  these  data, 
however,  the  differences  between  panel-robust  and  nonrobust  standard  errors  remain 
large,  in  part  because  of  failure  to  additionally  control  for  heteroskedasticity. 

Clearly  panel-robust  standard  errors  should  be  used. 


21.3.3.  Graphical  Analysis 

It  is  insightful  to  perform  a  graphical  comparison  of  overall,  between,  and  fixed  effects 
(within  or  first-differences)  regressions.  Such  plots  are  rarely  performed  in  panel  data 
regression,  but  they  are  easily  applied  here  as  there  is  only  one  regressor. 

All  plots  include  a  nonparametric  regression  curve  using  the  Lowess  smoother  (see 
Section  9.6.2)  and  a  linear  regression  curve  that  corresponds  to  the  estimates  given  in 
Table  21.2. 

Figure  21.1  plots  lnhrs  against  lnwg  for  all  firms  in  all  years  (5,320  observations). 
The  plot  suggests  a  positive  relationship,  roughly  linear  except  at  the  extreme  ends, 
and  from  Table  21.2  the  line  has  slope  0.083  with  a  low  R 2  =  0.015. 

The  between  estimator  (21.7)  regresses  y,-  on  x, .  The  corresponding  plot  for  the 
lnhrs-lnwg  data  is  given  in  Figure  21.2  and  again  shows  a  positive  relationship. 

The  within  or  fixed  effects  estimator  (21.8)  regresses  (y,-f  —  v, )  on  (xlt  —  x,  )• 
Figure  21.3  gives  the  related  plot  of  (y,-f  —  y,  +  y)  on  (x,-f  —  x,-  +x),  where  y  = 
N  1  JT  y,-  and  x  =  A-1  JT  x,  are  the  grand  means  of  y  and  x.  Comparison  with  Fig¬ 
ure  21.1  shows  that  differencing  the  individual  mean  leads  to  a  considerable  decrease 
in  the  range  of  variability  in  lnwg,  with  less  of  a  decrease  in  the  variability  of  lnhrs. 


Pooled  (Overall)  Regression 
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Figure  21.1:  Hours  and  wages:  pooled  (overall)  regression.  Natural  logarithm  of  annual 
hours  worked  plotted  against  natural  logarithm  of  hourly  wage.  Data  for  532  U.S.  males  for 
each  of  the  ten  years  1979-88. 


712 


21.3.  LINEAR  PANEL  EXAMPLE:  HOURS  AND  WAGES 


Log  hourly  wage 

Figure  21.2:  Hours  and  wages:  between  regression.  Ten-year  average  of  log  hours  plotted 
against  ten-year  average  of  log  wage  for  532  men.  Same  sample  as  Figure  21 .1 . 

The  slope  does  appear  steeper  than  that  for  pooled  OLS,  and  from  Table  21 .2  the  slope 
increased  from  0.083  to  0.168. 

The  first-differences  estimator  (21.9)  regresses  (y,-f  —  i )  on  (xlt  —  xit-\ ).  The 
corresponding  plot  for  the  lnhrs  -  lnwg  data  is  given  in  Figure  21.4.  The  figure  is 
qualitatively  similar  to  Figure  21.3. 

The  conclusion  of  the  preceding  analysis  is  that  there  is  greater  response  to  wage 
changes  using  time-series  variation  than  using  cross-section  variation. 

21.3.4.  Residual  Analysis 

It  is  instructive  to  consider  the  autocorrelation  patterns  of  the  data  and  of  residuals.  For 
example,  for  residuals  uit  =  yit  —  y],  the  autocorrelation  between  period  .v  and  period 
t  is  calculated  as  psr  =  cst/*Jcssctt,  s,t  =  I , . . . ,  T.  where  the  covariance  estimate 
cst  =  (N  -  l)-1  J2,(“n  -  Ut)(uis  -us)  and  u,  =  ]T,  uit. 


Within  (Fixed  Effects)  Regression 
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Figure  21.3:  Hours  and  wages:  within  (fixed  effects)  regression.  Deviation  of  log  hours 
from  ten-year  average  plotted  against  deviation  of  log  wage  from  ten-year  average  using 
ten  years  of  data  for  532  men.  Same  sample  as  Figure  21.1. 
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First  Differences  Regression 
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Figure  21.4:  Hours  and  wages:  first  differences  regression.  First  difference  of  log  hours 
plotted  against  first  difference  of  log  wage  using  ten  years  of  data  for  532  men.  Same 
sample  as  Figure  21.1. 


Table  21.3  gives  the  residual  autocorrelations  after  pooled  OLS  regression  of  lnhrs 
on  lnwg.  The  autocorrelations  generally  lie  between  0.2  and  0.4  for  data  two  to  nine 
periods  apart.  The  decay  rate  is  very  slow,  and  the  autocorrelations  appear  closer  to  a 
random  effects  model  that  assumes  that  Cor[H,, ,  wM  ]  is  constant  for  t  s  than  to  an 
AR(1)  model  that  has  exponential  decay. 

The  autocorrelations  for  lnhrs  before  regression  are  very  similar  to  those  given  in 
Table  21.3,  since  uit  ~  yit  as  evident  from  the  poor  explanatory  power  of  pooled  OLS 
with  R2  =  0.015.  The  autocorrelations  for  the  regressor  lnwg,  not  tabulated  here,  are 
much  higher,  ranging  from  approximately  0.9  at  one  lag,  to  0.7  at  nine  lags. 

The  correlations  of  the  residuals  from  the  within  regression  are  given  in  Table  21.4. 
If  the  original  errors  Sjt  in  (21.3)  are  iid  then  it  can  be  shown  that  the  transformed 
errors  e,-f  —  e,  have  autocorrelations  at  all  lags  equal  to  —  l/(T  —  1)  =  —0.11.  There 
is  some  departure  from  this  here,  particularly  for  the  first  lag,  which  is  always  positive. 


Table  21.3.  Hours  and  Wages:  Autocorrelations  of  Pooled  OLS  Residual sa 


u79 

u80 

u81 

u82 

u83 

u84 

u85 

u86 

u87 

u88 

upols79 

1.00 

upols80 

.33 

1.00 

upols81 

.44 

.40 

1.00 

upols82 

.30 

.31 

.57 

1.00 

upols83 

.21 

.23 

.37 

.47 

1.00 

upols84 

.20 

.23 

.32 

.34 

.64 

1.00 

upols85 

.24 

.32 

.41 

.35 

.39 

.58 

1.00 

upols86 

.20 

.19 

.28 

.25 

.31 

.35 

.40 

1.00 

upols87 

.20 

.32 

.33 

.29 

.31 

.34 

.39 

.35 

1.00 

upols88 

.16 

.25 

.30 

.26 

.21 

.25 

.34 

.55 

.53 

1.00 

a  Note:  Autocorrelations  of  residuals  are  from  pooled  OLS  regression  of  lnhrs  on  lnwg  for  532  men  in  10  years. 
The  autocorrelations  die  slowly. 
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Table  21.4.  Hours  and  Wages:  Autocorrelations  of  Within  Regression  Residuals a 


u79 

u80 

u81 

u82 

u83 

u84 

u85 

u86 

u87 

u88 

ufe79 

1.00 

ufe80 

.10 

1.00 

ufe8 1 

.21 

.08 

1.00 

ufe82 

.00 

-.04 

.26 

1.00 

ufe83 

-.26 

-.27 

-.21 

.01 

1.00 

ufe84 

-.26 

-.27 

-.30 

-.20 

.32 

1.00 

ufe85 

-.18 

-.10 

-.11 

-.17 

-.16 

.17 

1.00 

ufe86 

-.19 

-.25 

-.26 

-.27 

-.17 

-.14 

-.08 

1.00 

ufe87 

-.15 

-.05 

-.16 

-.20 

-.24 

-.21 

-.09 

-.09 

1.00 

ufe88 

-.17 

-.11 

-.14 

-.18 

-.38 

-.31 

.13 

.24 

.24 

1.00 

a  Autocorrelations  of  residuals  are  from  within  (fixed  effects)  regression  of  lnhrs  on  lnwg  for  532  men  in  10 
years. 


The  correlations  of  the  residuals  from  random  effects  regression  are  quite  simi¬ 
lar  to  those  for  fixed  effects  given  in  Table  21.4.  The  correlations  of  residuals  from 
first-differences  regression  are  qualitatively  similar  to  the  theoretical  result  that  if  the 
original  errors  eit  in  (21.3)  are  iid  then  the  transformed  errors  e,-(  —  Sjt~\  have  autocor¬ 
relations  of  0.5  at  lag  one  and  0  at  other  lags. 


21.4.  Fixed  Effects  versus  Random  Effects  Models 

The  fixed  effects  model  has  the  attraction  of  allowing  one  to  use  panel  data  to  establish 
causation  under  weaker  assumptions  (presented  in  Section  21.4.1)  than  those  needed 
to  establish  causation  with  cross-section  data  or  with  panel  data  models  without  fixed 
effects,  such  as  pooled  models  and  random  effects  models. 

In  some  studies  causation  is  clear,  so  random  effects  may  be  appropriate.  For  exam¬ 
ple,  in  a  controlled  experiment  such  as  crop  yield  from  different  amounts  of  fertilizers 
applied  to  different  fields  the  causation  is  clear.  In  other  cases  it  may  be  sufficient  to 
use  a  random  effects  analysis  to  measure  the  extent  of  correlation,  with  determination 
of  causation  left  to  further  research  taking  other  approaches.  The  effect  of  smoking  on 
lung  cancer  is  an  example.  Economists  are  unusual  in  instead  preferring  a  fixed  effects 
approach,  however,  because  of  a  desire  to  measure  causation  in  spite  of  reliance  on 
observational  data. 

The  fixed  effects  model  has  several  practical  weaknesses.  Estimation  of  the  coeffi¬ 
cient  of  any  time-invariant  regressor,  such  as  an  indicator  variable  for  gender,  is  not 
possible  as  it  is  absorbed  into  the  individual-specific  effect.  Coefficients  of  time- 
varying  regressors  are  estimable,  but  these  estimates  may  be  very  imprecise  if  most 
of  the  variation  in  a  regressor  is  cross  sectional  rather  than  over  time.  Prediction  of  the 
conditional  mean  is  not  possible.  Instead,  only  changes  in  the  conditional  mean  caused 
by  changes  in  time-varying  regressors  can  be  predicted.  Even  coefficients  of  time- 
varying  regressors  may  be  difficult  or  theoretically  impossible  to  identify  in  nonlinear 
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models  with  fixed  effects  (see  Chapter  23).  For  these  reasons  economists  also  use  ran¬ 
dom  effects  models,  even  if  causal  interpretation  may  then  be  unwarranted. 


21.4.1.  Fixed  Effects  Example 

Consider  the  effect  of  computer  use  on  wage.  Several  cross-sectional  studies,  most  no¬ 
tably  those  by  Krueger  (1993)  and  DiNardo  and  Pischke  (1997),  find  that  computer  use 
in  a  job  is  associated  with  substantially  higher  wages,  even  after  controlling  for  many 
determinants  of  the  wage  such  as  education,  age,  gender,  industry,  and  occupation.  As 
emphasized  by  DiNardo  and  Pischke  (1997)  this  does  not  necessarily  imply  causa¬ 
tion,  if  regressors  are  correlated  with  the  error  term  owing  to  endogeneity  or  omitted 
variables. 

Specifically,  we  suppose  that  in  the  cross  section 

y,-  =  x'j/3  +  a,  +si, 

where  y  is  the  natural  logarithm  of  wage,  x  is  a  vector  of  individual  characteristics 
including  an  indicator  variable  for  computer  use  at  work,  and  s  is  an  error  that  is 
assumed  to  be  independent  of  x.  The  complication  is  the  addition  of  the  unobserved 
variable  a,  which  is  assumed  to  be  correlated  with  computer  use  at  work,  and  hence 
with  the  observed  regressors  x,  even  though  the  components  of  x  other  than  computer 
use,  such  as  occupation  and  education,  may  partly  control  for  computer  use  at  work. 
Regression  of  y  on  x  leads  to  omitted  variables  bias  leading  to  inconsistent  estimates 
of  /3  as  the  combined  error  (a  +  s)  is  correlated  with  x. 

Panel  data  offer  a  way  around  this  problem,  if  we  assume  that  the  unobserved  vari¬ 
able  a,  is  time-invariant.  Then 

Vi  t  =  xJ,/3  +  a  i  +  Si,, 

where  again  s  is  uncorrelated  with  x  and  a  is  correlated  with  x.  Differencing  eliminates 
o',  (see  Section  21.2.2),  permitting  consistent  estimation  of  f3.  For  the  computer  use 
example,  the  causative  effect  of  computer  use  on  wages  is  then  measured  by  the  associ¬ 
ation  between  individual  changes  in  wages  and  individual  movements  to  or  from  a  job 
with  a  computer.  Haisken-DeNew  and  Schmidt  (1999)  found  no  effect  using  German 
panel  data. 

This  fixed  effects  panel  approach  permits  determination  of  causation  under  weaker 
assumptions  than  those  of  cross-section  analysis,  but  it  still  requires  assumptions.  The 
key  assumption  is  that  the  unobservables  a,  are  time-invariant,  rather  than  being  of 
more  general  form  ait.  In  the  computer  use  example  it  is  being  assumed  that  an  in¬ 
dividual’s  propensity  to  have  a  job  with  a  computer  may  be  endogenous,  but  the  un¬ 
observed  component  of  the  effect  of  this  propensity  o',  on  wage  is  constant  over  time 
once  we  control  for  observables  xit.  Essentially  the  particular  time  periods  in  which  an 
individual’s  job  does  or  does  not  involve  use  of  a  computer  are  assumed  to  be  purely 
random,  once  we  control  for  time-invariant  unobservable  a,  and  observable  x,; . 

A  random  effects  or  pooled  panel  approach  does  not  have  similar  properties.  It 
instead  assumes  away  the  original  concern  that  a  is  correlated  with  x,  since  it  as¬ 
sumes  that  a  is  iid  [0,  a 2]  and  hence  is  uncorrelated  with  x.  This  leads  to  inconsistent 
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parameter  estimates  if  in  fact  a  is  correlated  with  x,  whereas  the  fixed  effects  estimator 
is  consistent  if  a  is  correlated  with  x,  provided  a  is  time-invariant. 


21.4.2.  Conditional  versus  Marginal  Analysis 

Fixed  effects  estimation  is  a  conditional  analysis,  measuring  the  effect  of  xit  on  yit 
controlling  for  the  individual  effect  a,  .  Prediction  is  possible  only  for  individuals  in 
the  particular  sample  being  used,  and  even  then  it  is  only  possible  if  the  panel  is  long 
enough  so  that  a,  can  be  consistently  estimated.  Random  effects  estimation  is  instead 
an  example  of  marginal  analysis  or  population-averaged  analysis,  as  the  individual 
effects  are  integrated  out  as  iid  random  variables.  The  random  effects  estimators  can 
be  applied  outside  the  sample. 

If  the  true  model  is  a  random  effects  model,  then  whether  to  perform  a  conditional 
or  marginal  analysis  will  vary  with  the  application.  If  analysis  is  for  a  random  sample 
of  countries  then  one  uses  random  effects,  but  if  one  is  intrinsically  interested  in  the 
particular  countries  in  the  sample  then  one  does  fixed  effects  estimation  even  though 
this  can  entail  a  loss  of  efficiency. 

If  the  true  model  instead  has  individual-specific  effects  correlated  with  regressors, 
however,  then  a  random  effects  analysis  is  no  longer  meaningful  as  the  random  effects 
estimator  is  inconsistent.  Instead,  alternative  estimators  such  as  the  fixed  effects  and 
first-differences  estimators  are  necessary.  Because  of  the  desire  to  determine  causation 
microeconomic  applications  emphasize  these  latter  estimators. 


21.4.3.  Hausman  Test 

If  individual  effects  are  fixed  the  within  estimator  /3W  is  consistent  whereas  the  random 
effects  estimator  $RE  is  inconsistent.  Here  (3  refers  to  the  vector  of  coefficients  of  just 
the  time-varying  regressors.  One  can  therefore  test  whether  fixed  effects  are  present  by 
using  a  Hausman  test  of  whether  there  is  a  statistically  significant  difference  between 
these  estimators.  Alternatively,  any  other  pair  of  estimators  with  similar  properties, 
such  as  first  differences  versus  pooled  OLS,  can  be  used. 

A  large  value  of  the  Hausman  test  statistic  leads  to  rejection  of  the  null  hypothesis 
that  the  individual-specific  effects  are  uncorrelated  with  regressors  and  to  the  conclu¬ 
sion  that  fixed  effects  are  present.  It  may  still  be  possible  to  avoid  using  a  fixed  effects 
model.  If  regressors  are  correlated  with  individual-specific  effects  caused  by  omitted 
variables,  then  one  can  add  further  regressors,  either  time  varying  or  time-invariant, 
and  again  perform  a  Hausman  test  in  this  larger  model  to  see  whether  fixed  effects  are 
still  necessary.  Even  if  such  correlation  persists  it  may  be  possible  to  estimate  a  random 
effects  model  using  instrumental  variables  methods  (see  Sections  22.4.3-22.4.4). 


Computation  When  RE  Is  Fully  Efficient 

We  begin  by  assuming  that  the  true  model  is  the  random  effects  model  (21.3)  with  a, 
iid  [0,  ct2]  uncorrelated  with  regressors  and  error  e;,  iid  [0,  a,2 1 . 
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Then  the  estimator  /3RE  is  fully  efficient,  so  from  Section  8.3  the  Hausman  test 
statistic  simplifies  to 

H  =  (/3i,re  —  /^i,w)  [V[/3Ew]  —  V[/3ERE]]  (/3i,re  —  fli.w)  > 

where  (3{  denotes  the  subcomponent  of  (3  corresponding  to  time-varying  regressors 
since  only  that  component  can  be  estimated  by  the  within  estimator.  This  test  stastistic 
is  asymptotically  /2(dim[/3]])  distributed  under  the  null  hypothesis. 

Hausman  (1978)  showed  that  an  asymptotically  equivalent  version  of  this  test  is  to 
perform  a  Wald  test  of  7  =  0  in  the  auxiliary  OLS  regression, 

yit  -  A. A;  =  (1  -  A,)/X  +  (xu,  -  XxuYPi  +  (x1/f  -  xu)'7  +  vit,  (21.15) 

where  X|/,  denotes  the  time-varying  regressors  and  A  is  defined  in  (21.11)  and  only 
the  time-varying  regressors  are  used.  This  algebraic  result  can  be  interpreted  as  fol¬ 
lows.  The  individual-specific  effects  model  (21.10)  implies  that  va  =  (1  —  A  )cy,  + 
(e,7  —  A .sp.  The  random  effects  estimator  is  actually  obtained  by  OLS  estimation  of 
(21.15)  with  7  =  0  (see  (21.10)).  If  instead  the  fixed  effects  specification  is  valid  then 
the  error  va  will  be  correlated  with  the  regressors,  via  correlation  of  a,  with  regres¬ 
sors.  This  correlation  leads  to  additional  functions  of  the  regressors,  such  as  (Xjt  —  x,  ), 
being  statistically  significant  variables  in  (21.15). 


Computation  When  RE  Is  Not  Fully  Efficient 

The  simple  form  of  the  Hausman  test  is  invalid  if  a,  or  sn  are  not  iid,  which  is 
more  than  likely  given  heteroskedasticity  inherent  in  much  microeconometrics  data. 
Then  the  RE  estimator  is  not  fully  efficient  under  the  null  hypothesis  so  the  expres¬ 
sion  V[/3W]  —  V[/3re]  in  the  formula  for  H  needs  to  be  replaced  by  the  more  general 
V[/3re  -  /3W]  (see  Section  8.3). 

For  short  panels  this  variance  matrix  can  be  consistently  estimated  by  bootstrap 
resampling  over  i  (see  Section  21.2.3).  Then  a  panel-robust  Hausman  test  statistic  is 

HRobust  =  (/3l,RE  — /3l,w)  [^Boot[/3liRE  —  /3l,wl]  (/3l,RE  ~  /^i,w)  >  (21.16) 

where 

1  b  _  _  , 

VBoot[/3i  RE  —  /3l,wl  =  B  —  \  _  ***)  (^h  —  ’ 

b=  1 

b  denotes  the  Mil  of  B  bootstrap  replications  (see  Section  21.2.3),  and  6  =  Pi  RE  — 
/31  w.  This  test  statistic  can  be  applied  to  subcomponents  of  /d,  and  can  use  alternative 
estimators  such  as  POLS  in  place  of  >RE  and  /91  FD  in  place  of  /3EW. 

Alternatively,  Wooldridge  (2002)  suggests  estimating  the  auxiliary  OLS  regression 
(21.15)  and  testing  7  =  0  using  panel-robust  standard  errors.  If  the  effects  are  random, 
though  not  necessarily  such  that  a,  and  e,-,  are  iid,  then  Vj,  =  (1  —  A)a,  +  (e,,  —  Ae,  )  is 
still  uncorrelated  with  regressors  though  v,,  is  no  longer  asymptotically  iid,  so  cluster- 
robust  standard  errors  need  to  be  used.  If  the  effects  are  fixed  then  the  error  u,,  is 
correlated  with  the  regressors,  leading  to  significance  of  additional  functions  of  the 
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regressors  such  as  (xit  —  x().  This  robust  version  of  the  auxiliary  regression  for  the 
Hausman  test  is  preferred  to  one  that  assumes  %  is  asymptotically  iid,  on  the  usual 
grounds  of  minimizing  distributional  assumptions.  However,  it  is  not  clear  whether 
this  test  actually  coincides  with  the  Hausman  test  when  RE  is  inefficient. 


Hausman  Test  Example 

For  the  lnhrs-lnwg  example  estimates  given  in  Table  21.2,  a  comparison  of  FE 
and  RE  estimates  using  the  default  standard  errors  yields  H  ~  (0.168  —  0.1 19)2/ 
(0.0192  —  0.0142).  This  leads  to  H  =  14  >  /  q5(1)  =  3.84,  so  the  random  effects 
model  is  rejected. 

This  test  is  not  appropriate,  however.  The  statistic  H  is  inflated  because  the  usual 
standard  errors  in  this  example  are  greatly  downward  biased  (see  Section  21.3.2).  Fur¬ 
thermore,  this  bias  is  a  signal  that  the  RE  estimator  is  not  fully  efficient  under  Hq,  so 
that  the  more  general  form  of  the  Hausman  test  needs  to  be  used. 

The  auxiliary  regression  (21.15)  yields  a  panel-robust  f -statistic  for  y  of  1.28  and 
hence  H*  =  1.282  =  1.65,  leading  to  nonrejection  of  the  random  effects  model  at  5%. 
Even  though  the  wage  elasticity  estimates  differ  by  0.049,  the  estimates  are  sufficiently 
imprecise  that  the  difference  is  not  statistically  significant.  Note  that  if  the  nonrobust  t- 
statistic  for  y  is  used  instead,  then  t2  =  13.69,  close  to  the  previous  incorrect  Hausman 
test  statistic. 


21.4.4.  Richer  Models  for  Random  Effects 

The  random  effects  model  specifies  that  the  random  effect  a,  is  distributed  indepen¬ 
dently  of  regressors.  Richer  models,  closer  in  spirit  to  fixed  effects  models,  relax  this 
assumption. 

Mundlak  (1978)  allowed  individual  effects  in  the  panel  model  (21.3)  to  be  deter¬ 
mined  by  time  averages  of  the  regressors,  so  that  a,  =  x-7T  +  Wj,  where  ?/,>,  is  iid. 
Then  efficient  GLS  estimation  of  (3  and  n  in  this  expanded  model  leads  to  an  estima¬ 
tor  of  / 3  that  equals  the  fixed  effects  estimator  in  model  (21.3).  By  contrast  the  usual 
random  effects  estimator  of  [3  in  model  (21.3)  that  erroneously  specifies  iid  random 
effects  will  be  inconsistent. 

Chamberlain  (1982,  1984)  considered  an  even  richer  model  for  the  random  effects, 
with  oij  =  x j  j  7r i  +  •  •  •  +  x'T-Trr  +  Wj,  a  weighted  sum  of  the  regressors.  He  proposed 
estimation  by  minimum  distance  methods  (see  Section  22.2.7  for  details),  leading  to 
an  estimator  of  (3  that  equals  the  fixed  effects  estimator. 

More  generally,  mixed  linear  models  and  hierarchical  linear  models  of  Section  24.6 
permit  quite  general  models  for  random  intercepts  and  also  random  slope  parameters. 
Bayesian  analysis  of  panel  data  also  uses  this  framework.  See  Section  22.8  for  details. 

In  linear  models  the  fixed  effects  approach  is  used  if  the  unobserved  individual 
effect  is  correlated  with  regressors.  In  more  complicated  models,  such  as  nonlinear 
models,  fixed  effects  models  are  not  always  estimable  and  richer  random  effects  mod¬ 
els  provide  an  alternative  approach. 
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21.5.  Pooled  Models 

The  pooled  cross-section  time-series  model  or  constant-coefficients  model  is 

yu  —  a  +  X///3  +  uit.  (21.17) 

In  the  statistics  literature  the  model  is  called  a  population-averaged  model,  as  there 
is  no  explicit  model  of  yit  conditional  on  individual  effects.  Instead,  any  individual 
effects  have  implicitly  been  averaged  out.  The  random  effects  model  is  a  special  case 
where  the  error  is  cqui correlated  over  t  for  given  i  (see  Section  21.2.1). 

The  main  complication  for  statistical  inference,  assuming  no  fixed  effects,  is  that 
the  distribution  of  least-squares  estimators  of  this  model  varies  with  the  assumed  dis¬ 
tribution  of  mt •  In  short  panels,  panel-robust  standard  errors  can  be  obtained  using 
(21.13). 

Here  we  instead  focus  on  GLS  estimation  using  many  of  the  different  specifications, 
including  equicorrelation,  for  the  covariance  structure  of  uit  over  time  and  individuals 
that  have  been  proposed  in  the  literature. 

Although  we  focus  on  pooled  GLS  estimation  of  (21.17),  a  model  without 
individual-specific  fixed  effects,  the  methods  of  this  section  can  be  applied  more  gen¬ 
erally  to  pooled  GLS  estimation  of  the  transformed  model  (21.12)  of  Section  21.2.3. 


21.5.1.  Pooled  OLS,  FGLS,  and  WLS  Estimators 

It  is  convenient  to  use  matrix  notation.  Combining  observations  over  time  for  a  given 
individual,  define 


y,  =  W,d  +  u,,  (21.18) 

where  6  =  [a.  (3']'  is  a  (K  +  1)  x  1  parameter  vector,  y,  and  u,  arc  T  x  I  vectors 
with  fth  entries  yu  and  w,f,  respectively,  and  W,  is  a  T  x  (K  +  1)  matrix  with  rth  row 
w-,  =  [1  Xj(]\  Stacking  all  individuals  yields 

y  =  W<5  +  u,  (21.19) 

where  y  and  u  are  NT  x  1  vectors,  for  example  y  =  [y'j  . . .  v'v  |',  and  W  is  an 
NT  x  (K  T  1)  regressor  matrix  whose  first  column  is  a  vector  of  ones.  We  assume 
that  E[u|W]  =  0,  so  errors  are  strictly  exogenous,  and  define  $2  =  E[iiu'|  W|. 

There  are  several  possible  least-squares  estimators  of  this  model,  summarized  in 
Table  21.5. 

First,  pooled  OLS  is  consistent  and  asymptotically  normal.  However,  in  a  panel 
setting  it  is  unlikely  that  1 2  =  a2lNT,  so  OLS  is  inefficient  except  in  some  special 
cases  such  as  when  all  regressors  are  time-invariant.  More  importantly,  the  usual  OLS 
variance  estimate  of  rr2(W'W)  1  should  not  be  used  and  a  panel-robust  estimate  such 
as  that  in  (21.13)  needs  to  be  used. 

Second,  pooled  feasible  GLS  (PFGLS)  is  consistent  and  fully  efficient  if  12  is  cor¬ 
rectly  specified  and  12  is  consistent  for  12.  Some  of  the  very  large  range  of  structures 
on  uu  and  hence  12  that  have  been  proposed  in  the  panel  literature  and  incorporated 
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Table  21.5.  Pooled  Least-Squares  Estimators  and  Their  Asymptotic  Variances 


Estimator 

Formula" 

Variance  Matrix* 

Pooled  OLS:  6pols 

(W'W)_1W'y 

(W'lT'wr'w'iTV 

(W'X_1W)-1W'X“1y 

(W'WJ-'W'fiWlW'W)"1 

Pooled  FGLS:  ^pfgls 

(W'^'V)-1 

Pooled  WLS:  ^pwls 

(W,S_1W)-1W,S-112S-1W 

x(W'X_1W)"1 

a  The  formulas  are  for  the  model  v  =  VVV)  +  u  defined  in  (21.19)  and  error  matrix  Q. 
b  For  computation  of  £2  for  the  variance  matrices  of  POLS  and  PWLS  see  the  text;  in  those  cases  £2 
need  not  be  consistent  for  £2.  For  pooled  FGLS  it  is  assumed  that  £2  is  consistent  for  £2. 


into  regression  packages  are  given  in  Sections  21.5.2  and  21.5.3  for,  respectively,  short 
and  long  panels. 

Third,  the  pooled  weighted  LS  (PWLS)  estimator  guards  against  misspecification 
of  12.  It  posits  a  working  matrix  X  for  the  error  variance  matrix  12  but  then  per¬ 
forms  inference  that  is  valid  even  if  X  /  12.  Ordinary  least  squares  is  an  example, 
with  X  =  cr2INT,  but  other  choices  of  X  may  improve  efficiency. 

Estimation  of  the  variance  matrix  of  the  pooled  OLS  estimator  requires  an  12  such 
that  (NT)-lWn,Vf  consistently  estimates  (NT)~lW'QW. 

For  short  panels  this  is  possible  by  direct  application  of  the  results  of  Section  21.2.3. 
Estimation  of  the  variance  matrix  of  the  pooled  WLS  estimator  requires  an  12  such  that 
(VTT'W'fEftX^W  consistently  estimates  (A/Tr'W'X'^X^W.  The  panel- 
robust  estimate  for  OLS  given  in  (21.13)  can  be  adapted  to  pooled  WLS  by  replacing 
YV'X  1 12X  1 W,  or  equivalently  W/Xr 1  E[u,  u'  |  W,  ]X(  1 W,  given  independence 

over  i,  by  the  quantity  JV  W/X,-  u,u-X,  1W,-,  where  u,  =  y ;  —  W,-6.  Alternatively, 
a  panel  bootstrap  can  be  used. 


21.5.2.  Error  Variance  Matrix  for  Short  Panels 


In  short  panels  there  are  few  time  periods  but  many  individuals,  usually  peo¬ 
ple  or  firms.  It  is  assumed  that  errors  are  independent  over  individuals,  so  that 
Cov[m„,  ujs]  =  0,  i  7^  j .  In  such  cases  it  is  convenient  to  revert  to  summation  no¬ 
tation.  For  example,  the  PFGLS  esdmator  given  in  Table  21.5  becomes 


/^PFGLS  — 


Ew 

2  =  1 


T2,  'W; 


Ew 

2  =  1 


'nrV* 


(21.20) 


where  12,  is  consistent  for 


12,  =  E[u,U;  |Wf],  (21.21) 

and  12,  is  nondiagonal  as  errors  for  a  given  individual  are  likely  to  be  correlated  over 
time.  Note  that  12,  needs  to  come  from  estimation  of  a  specified  model  for  12, ,  and  we 
cannot  use  12,  =  u,u'  (see  the  related  discussion  after  equation  (5.88)). 
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Equicorrelated  Errors 

The  most  commonly  used  error  structure  is  the  random  effects  model  presented  in  Sec¬ 
tion  21.2.1.  Then  from  (21.6)  $2,  has  common  diagonal  entries  o 2  +  er2  and  common 
off-diagonal  entries  cr2.  Equivalently,  the  errors  are  equicorrelated,  with  f 2,-  having 
common  diagonal  entries  er2  and  common  off-diagonal  entries  per2.  Implementation 
of  FGLS  requires  only  estimation  of  a2  and  a 2,  or  of  a 2  and  p  (see  Sections  21.2.2 
and  21.7). 


ARMA  Errors 

An  alternative  error  structure  is  to  assume  an  ARMA  error  model.  For  example, 
an  AR(1)  error  model  specifies  that  u,,  =  f>uu-  \  +  £,,,  where  £,,  are  iid.  Then 
Cov[m,7,  M,  v  j  =  plf_^l(j2.  In  this  case  the  covariance  between  errors  falls  as  the  number 
of  time  periods  between  the  errors  increases.  The  RE  model  and  an  AR(1)  error  model 
are  compared  in  Section  21.5.4. 

Baltagi  and  Li  (1991)  combine  the  two  error  models  to  consider  a  random  effects 
model  with  AR(1)  errors.  This  can  be  easily  generalized  to  the  AR( p)  case,  and  meth¬ 
ods  for  moving  average  and  ARMA  errors  (see  Section  5.8.7)  in  random  effects  mod¬ 
els  have  also  been  developed  more  recently.  A  summary  is  given  in  Baltagi  (2001, 
Chapter  5). 


Homoskedastic  Errors  with  Unstructured  Autocorrelation 

For  FGLS  estimation  in  short  panels  there  is  actually  no  need  to  impose  as  much 
structure  as  that  imposed  by  an  RE  model  or  an  AR(1)  error  model,  if  the  assumption 
is  made  that  the  T  x  T  matrix  12,  is  constant  over  i.  Then  there  are  “only”  T (T  +  l)/2 
covariance  parameters  to  estimate,  A  consistent  estimate  of  12,  is  then  12,  with  (r,  s)th 
entry  ats  =  N  1  The  preceding  models  also  assume  homoskedasticity, 

but  place  additional  structure  on  12, . 


Robust  Inference 

All  of  the  preceding  specifications  assume  that  error  covariances  are  the  same  across 
individuals,  which  rules  out  heteroskedasticity.  Provided  the  panel  is  short  one  can 
nonetheless  use  the  preceding  restrictive  error  variance  matrix  models  as  the  basis 
for  pooled  WLS  estimation,  but  then  obtain  robust  standard  errors  as  discussed  af¬ 
ter  Table  21.5.  Alternatively,  richer  mixed  models,  presented  in  Chapter  22,  can  be 
estimated. 

The  assumption  of  independence  over  i  is  maintained  throughout  Chapters  21-23, 
though  it  can  be  relaxed  even  for  small  T  provided  structure  can  be  placed  on  the 
correlation.  An  example  is  an  explicit  model  for  spatial  correlation  for  panel  data  on 
regions  such  as  states  or  countries,  with  correlations  declining  as  physical  distance 
between  individual  observations  increases. 
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21.5.3.  Error  Variance  Matrix  for  Long  Panels 

In  long  panels  there  are  many  time  periods  but  relatively  few  individuals.  Such  data 
can  arise  in  microeconometrics  analysis  if  the  individual  observational  unit  is  one  of 
only  a  few  regions,  such  as  a  state  or  country,  or  firms,  but  these  are  observed  over 
enough  time  periods  to  base  inference  on  the  assumption  that  T  oo  . 

Correlation  across  time  for  a  given  individual  can  be  introduced  using  an  ARMA 
model  for  the  errors,  with  the  parameters  of  the  ARMA  model  permitted  to  differ 
across  individuals  as  now  N  is  fixed  and  T  — >  oo.  For  example,  consider  an  AR(1) 
error  with  uit  =  pjUjj-\  +  e,r,  where  s,,  ~  [0,  a2]  is  heteroskedastic  and  p,  also  dif¬ 
fers  across  individuals.  Separate  regressions  of  y,-f  on  w(/  with  AR(1)  errors  for  each 
individual  using  T  time  periods  yields  consistent  estimates  p,  and  07,  since  T  00. 
These  can  then  be  used  for  feasible  GLS  estimation  of  6  using  all  NT  observations. 
For  details  see  Kmenta  (1986).  This  model  permits  both  heteroskedasticity  across  in¬ 
dividuals  and  correlation  over  time  for  a  given  individual.  Pesaran  (2004)  proposes  a 
considerably  richer  model  that  is  estimated  by  GLS. 

For  long  panels  it  is  possible  to  introduce  correlation  across  individuals,  so  that 
Cov[n,f,  uj,\  0  for  i  /  j,  since  N  is  fixed  and  asymptotic  results  rely  on  T  —?  00.  In 
particular,  one  can  perform  pooled  GLS  estimation  as  done  earlier,  with  the  assumption 
of  independence  across  individuals,  but  then  calculate  standard  errors  using  the  method 
of  Newy  and  West  (1987b),  mentioned  briefly  in  Section  6.4.4,  that  permits  arbitrary 
cross-sectional  dependence  and  serial  dependence,  provided  the  serial  dependence  dies 
away  sufficiently  fast.  For  details  see  Arellano  (2003,  p.  19). 

Time-series  considerations  for  panel  data  are  discussed  in  more  detail  in  Section 
22.5  for  models  with  lagged  dependent  variables  as  regressors. 


21.5.4.  The  Impact  of  Autocorrelated  Errors 

Panel  data  regression  models  have  errors  that  are  usually  autocorrelated  over  time 
for  a  given  individual.  If  fixed  effects  are  absent  then  pooled  OLS  regression  gives 
consistent  parameter  estimates.  However,  the  error  correlation  can  lead  to  large  bias 
in  standard  errors  for  pooled  OLS  if  autocorrelation  is  ignored  and  to  relatively  small 
efficiency  gains  as  the  length  of  a  panel  is  increased. 

The  analysis  is  particularly  simple  for  estimation  of  the  mean  of  y  based  on  T 
observations  for  one  individual  (so  IV  =  1)  with  equicorrelation.  Then  y,  =  (1  +  ut, 
and  the  OLS  estimator  is  the  sample  mean,  so  /3  =  y  =  T  1  ]G(  yt.  The  OLS  estimator 
has  true  variance  V[/3]  =  V[y]  =  T  2  Y2t  ^2sCov[ut,  us].  Assuming  equicorrelation 
the  double  sum  has  T  variances  equal  to  a2  and  T(T  —  1)  covariances  all  equal  to  pa 2 . 
Hence  V[y]  =  T~lo2{\  +  (T  —  1  )p).  Thus  the  iid  result  that  V[y ]  =  T~la2  needs  to 
be  modified  by  inflation  by  a  multiple  (1  +  p(T  —  1)).  In  particular  V[y]  approaches 
cr2  as  p  — >  1. 

Table  21.6  presents  the  impact  of  correlation  on  the  variance  of  y  for  different  values 
of  T  and  p,  where  for  simplicity  we  normalize  a2  =  1.  The  precision  of  estimation 
falls  considerably  as  p  increases,  and  the  estimate  of  V[y]  under  the  assumption  of 
independence  given  in  the  first  column  (assuming  a2  is  known  for  simplicity)  can 
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Table  21.6.  Variances  of  Pooled  OLS  Estimator  with  Equicorrelated  Errors a 


T 

II 

© 

b 

II 

O 

II 

O 

4- 

p=  0.6 

p  —  0.8 

II 

b 

1 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

2 

0.50 

0.60 

0.70 

0.80 

0.90 

1.00 

5 

0.20 

0.36 

0.52 

0.68 

0.84 

1.00 

10 

0.10 

0.28 

0.46 

0.64 

0.82 

1.00 

a  Given  are  the  variances  of  the  pooled  OLS  estimator  as  the  correlation  p  of  equicorrelated  errors  increases, 
for  an  intercept-only  model  with  error  variance  normalized  to  one  assuming  errors  are  correlated  though 
homoskedastic. 


greatly  understate  the  true  variance.  Furthermore,  for  p  >  0  the  gain  in  precision  due 
to  increase  in  the  number  of  time  periods  is  much  smaller  than  with  independent  data 
where  a  doubling  of  the  number  of  time  periods  will  halve  estimator  variance.  For 
example,  if  p  =  0.4  then  with  five  time  periods  the  estimator  variance  is  only  0.52 
times  that  with  one  period,  instead  of  the  much  lower  multiple  of  0.2  with  independent 
data.  Moreover,  a  doubling  from  5  to  10  time  periods  leads  to  only  a  small  reduction 
in  estimator  variance  from  0.52  to  0.46. 

This  result  holds  more  generally  for  balanced  panel  regression  with  equicorrelated 
errors  and  regressors  that  are  time-invariant,  where  the  true  variance  of  the  OLS  es¬ 
timator  is  (1  +  p(T  —  1))  times  that  assuming  independent  errors  (see  Kloek,  1981). 
In  practice  time -varying  regressors  are  also  included  and  clear  analytical  results  are 
more  difficult  to  obtain.  For  regression  with  intercept  and  single  time-varying  regres¬ 
sor,  Scott  and  Holt  (1982)  show  that  the  variance  of  the  slope  coefficient  is  inflated 
by  the  multiple  (1  +  'pxp(T  —  1)),  where  can  be  viewed  as  an  estimate  of  the 
individual-specific  autocorrelation  in  x.  For  panel  data  pr  is  often  high  so  that  there 
is  still  considerable  inflation.  These  results  also  apply  to  other  forms  of  clustered  data 
and  are  presented  in  more  detail  in  Section  24.5.2. 

The  preceding  analysis  assumes  equicorrelated  errors,  a  property  of  the  RE  model. 
If  instead  errors  are  AR(1)  there  is  greater  benefit  from  increasing  panel  length. 
Then  Cov[w?,  us ]  =  p^a2,  so  V[y]  =  T~2a2[T  +  2  J^i(T  ~  «)/>*]•  For  exam¬ 
ple,  if  p  =  0.8  then  V[y]  =  0.72a2  for  T  =  5  and  0.54cr2  for  T  =  10,  lower  than  the 
corresponding  values  from  Table  21.6  of  0.84a2  and  0.82a2  for  equicorrelation  with 
p  =  0.8,  but  still  much  higher  than  values  of  0.2a2  and  0.1a2  for  p  =  0.0. 

Microeconometricians  gravitate  to  the  RE  model  or  equicorrelated  error  models  for 
short  panels  as  an  outgrowth  of  the  literature  on  clustered  data  presented  in  Chapter  24. 
For  example,  consider  data  on  different  siblings  in  a  family  for  many  families.  Then 
it  is  natural  to  assume  that  correlations  of  unobservables  across  siblings  in  the  same 
family  are  the  same  for  different  siblings  pairs.  For  example,  the  correlation  between 
the  first  and  second  siblings  equals  that  between  the  first  and  third  siblings.  Those  using 
long  panel  data  instead  often  have  a  time-series  background  and  naturally  assume  that 
correlation  declines  over  time,  leading  to  models  such  as  an  AR(1)  error. 

Determining  which  model  of  time-series  correlation  is  more  reasonable  really  de¬ 
pends  on  the  data.  Many  short  panels  used  in  microeconomics  applications  yield 
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pooled  OLS  residual  autocorrelations  that  are  qualitatively  similar  to  those  given  in 
Table  21.3.  These  are  closer  to  an  RE  model  than  an  AR(1)  model,  though  an 
ARM A(  1,1)  model  may  do  well.  Better  still  may  be  an  RE  model  with  AR(1)  error. 
In  all  cases  error  correlation  leads  to  a  loss  of  information  and  the  usual  OLS  standard 
errors  understate  the  true  standard  errors.  For  short  panels  one  can  base  inference  on 
panel  robust  standard  errors  (see  Section  21.2.3)  that  do  not  require  specifying  a  model 
for  the  error  correlation. 

21.5.5.  Hours  and  Wages  Pooled  GLS  Example 

A  variety  of  pooled  GLS  estimates  and  associated  default  and  robust  standard  errors 
of  the  model  yit  =  a,  +  /Lr,,  +  uit  for  the  In  Ins  on  lnwg  regression  are  given  in  Ta¬ 
ble  21.7.  All  assume  the  error  uit  is  independent  over  i  and  identically  distributed  over 
i,  and  then  have  different  assumptions  on  correlation  in  «,r  over  t . 

The  first  column  of  Table  21.7,  for  the  pooled  OLS  estimator,  repeats  the  first  col¬ 
umn  of  Table  21.2. 

Pooled  GLS  estimates  assuming  equicorrelated  errors  are  given  in  the  second  col¬ 
umn  of  Table  21.7.  These  coincide  with  the  RE-GLS  column  in  Table  21.2,  since  the 
random  effects  model  implies  equicorrelated  errors  (see  (21.6)). 

Pooled  GLS  estimates  assuming  AR(1)  errors,  so  that  iij,  =  pu,,- 1  +  £<;  where  elt 
is  iid,  are  given  in  the  third  column  of  Table  21.7.  The  slope  coefficient  estimate  is 
close  to  the  pooled  OLS  estimate. 

Pooled  GLS  estimates  with  no  structure  placed  on  error  correlation  aside  from 
homoskedasticity,  so  that  Cov[m,>,  m,v  ]  =  ats,  are  given  in  the  fourth  column  of  Ta¬ 
ble  21.7.  Then  ots  is  consistently  estimated  given  small  T  by  ats  =  N~x  ’Y^=\ 
for  all  t  and  s.  These  are  again  close  to  the  pooled  OLS  estimate. 

It  is  clear  from  Table  21.7  that  panel-robust  standard  errors  should  be  used  rather 
than  the  default  standard  errors,  which  here  assume  homoskedasticity  and  correctly- 
specified  model  for  serial  correlation. 


Table  21.7.  Hours  and  Wages:  Pooled  OLS  and  GLS  Estimates a 


Estimator 

Error  correlation 

POLS 

PFGLS 

None 

Equi 

AR1 

General 

a 

7.442 

7.346 

7.440 

7.426 

p 

.083 

.120 

.084 

.091 

Robust  se 

(.029) 

(.052) 

(.037) 

(.050) 

Boot  se 

[.032] 

[.060] 

[.050] 

[-] 

Default  se 

{.009} 

{.014} 

{.012} 

{.014} 

a  Pooled  OLS  and  GLS  linear  panel  regression  of  lnhrs  on  lnwg  for  a  short  panel  as¬ 
suming  independence  and  identical  distribution  over  i  and  no  fixed  effects.  Pooled 
GLS  estimators  assume  equicorrelated  or  random  effects  errors  (equi),  AR(1)  errors 
(AR1),  or  no  structure  on  the  correlations  (general).  Standard  errors  for  the  slope 
coefficients  are  panel  robust  in  parentheses,  panel  bootstrap  in  square  brackets,  and 
usual  default  estimates  that  assume  iid  errors  in  curly  braces. 
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21.6.  Fixed  Effects  Model 

The  fixed  effects  model  specifies 

yit  —  o/i  +x'it(3  +  sit,  (21.22) 

where  the  individual-specific  effects  a\, . . .  ,a^  measure  unobserved  heterogeneity 
that  is  possibly  correlated  with  the  regressors,  x,r  and  (3  arc  K  x  I  vectors,  and  to 
begin  with  the  errors  are  iid  [0,  ct2]. 

The  challenge  for  estimation  is  the  presence  of  the  N  individual-specific  effects 
that  increase  in  number  as  N  — >  oo.  For  practical  purposes  we  are  most  interested 
in  the  K  slope  parameters  (3,  which  give  the  marginal  effect  of  change  in  regressors 
since  3E[y,,]/3x,r  =  (3.  The  N  parameters  a\ , . . .  ,a^  are  nuisance  parameters  or 
incidental  parameters  that  are  not  of  intrinsic  interest.  Nevertheless,  their  presence 
potentially  prevents  estimation  of  the  parameters  (3  that  are  of  interest. 

Remarkably,  for  the  linear  model  there  are  several  ways  to  consistently  estimate 
(3  despite  the  presence  of  these  nuisance  parameters.  These  include  (1)  OLS  in  the 
within  model  (21.8);  (2)  direct  OLS  estimation  of  the  model  (21.2)  with  indicator 
variables  for  each  of  the  N  fixed  effects;  (3)  GLS  in  the  within  model  (21.8);  (4)  ML 
estimation  conditional  on  the  individual  means  y,-,  /  =  1.  . . . ,  N;  and  (5)  OLS  in  the 
first-differences  model  (2 1 .9). 

The  first  two  methods  always  lead  to  the  same  estimator  for  (3.  So  too  does  the 
third  if  additionally  the  e!f  in  (21.22)  are  iid  and  the  fourth  if  ~  J\f\Q.  a2].  The  last 
estimator  differs  from  the  others  for  T  >  2.  Such  equivalences  generally  do  not  hold 
in  nonlinear  models,  which  are  considered  in  Chapter  23. 

The  essential  results  for  the  within  estimator  are  given  in  the  next  Section.  The  first- 
differences  estimator,  presented  in  Section  21.6.2,  is  extensively  used  in  Chapter  22 
when  regressors  are  no  longer  strongly  exogenous.  The  other  estimators  are  presented 
in  the  remainder  of  Section  21.6,  which  some  readers  may  wish  to  skip. 


21.6.1.  Within  or  Fixed  Effects  Estimator 


The  within  model  is  obtained  by  subtraction  of  the  time-averaged  model  y,-  =  a,  + 
Xj'(3  +  Sj  from  the  original  model.  Then 

yu  ~  y,  =  (x,7  -  x,)'/3  +  (eit  -  Sj),  (21.23) 


so  the  fixed  effect  o',  is  eliminated,  along  with  time-invariant  regressors  since  xit  — 
x,  =  0  if  xit  =  Xj  for  all  t. 

Using  OLS  estimation  yields  the  within  estimator  or  fixed  effects  estimator  (3W, 
where 


r  n  t 


X!  -  X,  )' 

i=  1  t=  1 


1 


N  T 


X!  X/x,r  "  NX>’<f  -yd- 

i=l  r=l 


The  individual  fixed  effects  a,  can  then  be  estimated  by 
a.i  =  y,  -  x-/3w,  i  —  1 - -  N. 


(21.24) 


(21.25) 
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The  estimate  a,  is  unbiased  for  or,-,  and  it  is  consistent  provided  T  — »■  oo  since  a, 
averages  T  observations.  In  short  panels  the  estimates  a,  are  inconsistent,  but  /3W  is 
nonetheless  consistent  for  (3.  The  a,  are  viewed  as  nuisance  parameters  or  ancillary 
parameters  that  fortunately  do  not  need  to  be  consistently  estimated  to  obtain  consis¬ 
tent  estimates  of  the  more  important  slope  parameters  (3.  This  remarkable  result  need 
not  carry  over  to  more  complicated  fixed  effects  models  such  as  nonlinear  models. 


Consistency  of  the  Within  Estimator 

The  within  estimator  of  (3  is  consistent  if  plimUVT’)-  1  Y2t(xii  ~  xi)(£n  ~  £<)  =  0. 
This  should  happen  if  either  N  — >  oo  or  T  — >  oo  and 

E[eif  -  si  \xit  -  x,]  =  0.  (21.26) 

Owing  to  the  presence  of  the  averages  x,  =  T  ~ 1  JT  x(,  and  e,  this  condition  is  stronger 
than  E[eif|x(f]  =  0.  A  sufficient  condition  for  (21.26)  is  the  strong  exogeneity  condi¬ 
tion  that  E[£„  |x,  i ,  . . . .  x,y  ]  =  0.  This  precludes  within  estimation  with  lagged  endoge¬ 
nous  variables  as  regressors  (see  Section  22.5). 


Asymptotic  Distribution  of  the  Within  Estimator 

The  distribution  of  /3W  appears  potentially  complicated  because  the  error  (sit  —  Sj)  in 
the  within  model  (21.8)  is  correlated  over  t  for  given  i.  It  is  shown  in  the  following 
that  the  usual  OLS  results  nonetheless  apply.  Under  the  strong  assumption  that  sit  is 
iid, 


V  [3W]  = 


-i-i 


i=i  t= i 


(21.27) 


where  x„  =  xif  —  x,-.  A  consistent  and  unbiased  estimate  of  a}  is  oy2  =  [  N ( T  —  1)  — 
A"]-1  Yli  Yli  U/i  where  the  degrees  of  freedom  equal  the  sample  size  NT  less  the 
number  of  model  parameters  K  and  the  N  individual  effects.  Note  that  if  the  regression 
(21.23)  is  estimated  using  a  standard  least-squares  package  then  we  need  to  inflate  the 
reported  variances  by  [N(T  —  1)  —  K]~l[NT  —  K], 

For  short  panels  (21.13)  yields  the  robust  estimate  of  the  asymptotic  variance 


V[3w]  = 


/= l  t=  l 


-l 


EEE 

1  =  1  t=  1  5=1 


XifX^SitSii 


EE^ 

i=i  t= i 


-i 


(21.28) 


where  =  sit  —  Si.  This  preferred  estimate  permits  arbitrary  autocoiTelations  for  the 
Su  and  arbitrary  heteroskedasticity. 
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Derivation  of  the  Variance  of  the  Within  Estimator 


We  now  derive  the  estimates  of  the  variance  of  the  within  estimator  given  in  (21.27) 
and  (21.28),  using  matrix  algebra.  We  begin  with  the  model  for  the  it th  observation 

V;  t  —  oi  i  +  X;,/3  +  Si,, 

where  x„  and  /3  arc  K  x  I  vectors.  For  the  /th  individual,  stack  all  T  observations,  so 


yn 

l 

~  x!i " 

sn 

— 

a,  + 

p+ 

_yn  _ 

_i_ 

L  Air  J 

_SiT  _ 

y,  =  ea,  +  X,/3  +  e(,  /'  =  1, - N,  (21.29) 

where  e  =  (l,l,...,l)/  is  a  T  xl  vector  of  ones,  X,  is  a  T  x  K  matrix,  and  y,  and 
Si  are  T  x  1  vectors. 

To  transform  model  (21.29)  to  the  within  model,  which  subtracts  the  individual- 
specific  mean,  introduce  the  T  x  T  matrix 

Q  =  I r  -  T~lee'.  (21.30) 


Premultiplication  by  the  matrix  Q  creates  deviations  from  the  mean,  since 

QW,  =  W,  -ew',  (21.31) 

where  W,  is  a  T  x  m  matrix  with  rth  row  w'(  and  w,  =  T  " 1  Y^t= \  w"  is  a  m  x  1 
vector  of  averages.  The  result  (21.31)  is  obtained  using  e'W,  =  7’vv'.  Note  also  that 
QQ'  =  Q,  using  e'e  =  T  and  Qe  =  0,  so  Q  is  idempotent. 

Premultiplying  the  fixed  effects  model  (21.29)  for  the  /  th  individual  by  Q  yields 


Qy,  =QX,/3  +  Q£((  /  =  l, N,  (21.32) 


using  Qe  =  0.  This  is  the  within  model  (21.23),  since  equivalently  y,  —  ey-  =  (X,  — 
ex'/)j3  +  (e,  —  ee, )  using  (21.31).  Thus  premultiplication  by  Q  yields  the  within  model. 
An  OLS  estimation  of  (21.32)  yields  /3W  with  variance  matrix,  assuming  independence 
over  /,  equal  to 


V  [3w]  - 


-l-i 


E  X-QQX 


i= 1 


E  X  ■  Q' VfQe,- 1 X/  ]  QX,- 


1  =  1 


H-l 


Ex;qqx, 


i=l 


(21.33) 


Begin  with  the  strong  the  assumption  that  sit  are  iid  [0,  n;  \ .  so  that  e,  are  iid 
[0,  cr£2I].  The  T  x  1  error  Q  e,  is  then  independent  over  /  with  mean  zero  and  vari¬ 
ance  V[Qef]  =  QV[e/]Q'  =  ae2QQ'  =  a2Q.  Then 


E  X-Q'V[Qe;  |X,]QX;  =  E  *;Q  ^QQX, 

1=1  i= 1 

=  °V  E  X;Q  QX,, 


i=  1 
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so  that  (21.33)  simplifies  to  the  estimate  given  in  (21.27),  using 

T 

(QX/)'(QX,  )  =  J>if  -  x,)(x,7  -  %)'. 

t=  l 

At  the  time  of  writing  many  packages  use  (21.27)  but  alternative  estimators  may  be 
better.  In  particular,  the  assumption  of  serially  uncorrelated  error  e,7  is  easily  relaxed. 
If  e,  are  iid  [0,  S,  |  we  use  the  more  general  form  of  the  variance  matrix  (21.33)  with 
Cov[ Qe, .  Qe;  ]  =  0,  for  i  /  j,  and  V[Qe,]  replaced  by  ('Qc/hQ?,/,  where  e,  =  y,  — 
X,/3W.  This  yields  the  estimate  given  in  (21.28). 

From  the  derivation  it  should  be  clear  that  /3W  is  also  consistent  in  the  random 
effects  model,  though  as  shown  in  Section  21.7  it  is  less  efficient  than  the  random 
effects  estimator  if  the  random  effects  model  is  appropriate. 


GLS  Estimation  of  the  Within  Model 

The  within  model  (21.32)  can  also  be  estimated  by  feasible  GLS. 

If  in  fact  Sjt  are  iid  [0,  a;  \ .  however,  then  there  are  no  gains  to  doing  GLS.  To  see 
this,  note  that  then  Qe,  is  independent  of  Qe;,  i  ^  /,  with  V[Qe,]  =  rr,2Q,  so  the  GLS 

estimator  is 


/3\v,gls  — 


-i-i 


Ex;q'q  qx 


i=i 


£x;q  q  Qy,, 


1=1 


where  the  generalized  inverse  Q  is  used  as  Q  is  not  of  full  rank.  However, 
Q'Q  Q  =  Q'Q  since  Q'Q  Q  =  Q,  for  a  generalized  inverse,  and  Q  =  QQ'  as  Q  here 
is  idempotent.  Replacing  Q'Q  Q  by  Q'Q  in  the  formula  for  /3WG[  S  yields  the  OLS 
estimator  in  (21.32). 

There  can  be  gains  to  GLS  if  other  models  for  sit  are  assumed.  The  approach  is 
essentially  the  same  as  that  in  Section  21.5.2  for  pooled  GLS  without  fixed  effects, 
except  that  first  the  fixed  effect  must  be  eliminated.  This  leads  to  error  Qe,  that  is  less 
than  full  rank,  so  we  first  drop  one  time  period  and  apply  pooled  GLS  to  only  (T  —  1) 
time  periods.  It  is  easier,  and  often  not  much  less  efficient,  to  instead  just  use  the  usual 
within  FE  estimator  and  then  obtain  panel-robust  standard  errors  using  (21.28). 

MaCurdy  (1982b)  gives  a  Box-Jenkins-type  analysis  for  identification  and  estima¬ 
tion  of  ARMA  processes  for  ei7  in  a  fixed  effects  model  for  a  short  panel.  For  short 
panels  it  is  not  necessary  to  assume  an  ARMA  process  for  elt  or  even  stationarity, 
since  for  N  — >  oo  we  can  always  consistently  estimate  Cov[«,7  ,  uls  |  by  N  1  JT  . 
Nonetheless,  there  may  be  interest  in  determining  the  ARMA  process  for  the  errors. 


21.6.2.  First-Differences  Estimator 

The  within  model  is  obtained  by  subtraction  of  the  time-averaged  model  y,  =  a,  + 
x,'/3  +  Si  from  the  original  model.  Alternatively,  one  can  subtract  the  model  lagged 
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one  period,  yiit_ i  =  cr,-  +  Then 

(. yu  -  yut- 1)  =  (x,7  -  X/.r-i)'/3  +  (%  -  £i,t- 1).  t  =  2,...,T,  (21.34) 

so  the  fixed  effect  a,  is  eliminated.  An  OLS  estimation  yields  the  first-differences 
estimator 


Pfd  — 


^2  “  Xi,r-l)(x,7  -  X/.r-i)' 


i=l  f=2 


-1 


y:  (Xif  -  x/,,-i)(y,7  -  y;,t-i). 

(21.35) 


i=l  /=2 


Note  that  there  only  A(T  —  1)  observations  in  this  regression.  An  easy  error  to  make 
in  implementation  is  to  stack  all  NT  observations  and  then  subtract  the  first  lag.  Then 
only  the  (1,  1)  observation  is  dropped,  whereas  all  T  first-period  observations  (i.  1), 
i  =  1, . . . ,  N,  must  be  dropped  after  differencing. 


Consistency  of  the  First-Differences  Estimator 

Consistency  of  the  first  differences  estimator  requires  that  E [sit  —  £,-,_  |  |x,(  —  x,  ,_  1 1. 
This  is  a  stronger  condition  than  E[£,;|x,r]  =  0  but  a  weaker  condition  than  the  strong 
exogeneity  condition  needed  for  consistency  of  the  within  estimator. 


Asymptotic  Distribution  of  the  First-Differences  Estimator 


Statistical  inference  requires  adjusting  the  usual  OLS  standard  errors  to  account  for  the 
correlation  over  time  in  the  error  term  sit  —  £r-,r_i.  To  obtain  the  asymptotic  variance 
of  /3fd,  stack  the  model  for  the  /  tli  individual  as 

Ay =  AXj/3  +  As,-, 


where  Ay,  is  a  (T  —  1)  x  1  vector  with  entries  (y,2  —  y,i), . . . ,  (y,r  —  Vi.r- 1  )■  and 
AX,-  is  a  (T  —  1)  x  K  vector  with  rows  (x,-2  —  x,  i )' , . . . ,  (xiT  —  x,j7-_i)'.  Then 


N 


-1 


y](AX,)'(AX,) 


N 

^(AX,)'(Ay,) 

i=  1 


(21.36) 


has  variance  matrix,  assuming  independence  over  i,  of 


V  [/3fd]  — 


-i-i 


£(AX,),(AXi) 


1  =  1 


J](AXi),V[A£,|AX,](AX;) 


1  =  1 


J](AX,)'(A  X,) 

1  =  1 

(21.37) 


The  simplest  assumption  is  that  sit  are  iid  [0,  crf2].  Then  the  error  (sit  —  £,-  ,_ i)  is  now 
an  MA(1)  error,  with  variance  2rr,2  and  one-period  apart  autocovariance  <t2  for  individ¬ 
ual  i.  It  follows  that  V[  Ae,  ]  equals  <t2  times  a  (T  —  1)  x  (T  —  1)  matrix  with  entries 
of  2  on  the  diagonal,  entries  of  1  on  the  immediate  off-diagonals,  and  Os  elsewhere. 

A  more  realistic  assumption  is  that  £,r  is  correlated  over  time  for  given  i,  so 
that  Cov[£,f,  £,v  |  /  0  for  t  ^  s,  but  is  still  independent  over  i.  From  (21.13),  for 
short  panels  an  estimator  that  is  robust  to  general  forms  of  autocorrelation  and 
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heteroskedasticity  is  (21.37)  with  V[  As,  ]  replaced  by  (Ae,),(A£,).  One  should  never 
use  the  usual  OLS  standard  errors  from  OLS  regression  of  the  first-differences  model 
(21.37),  as  these  are  only  correct  in  the  unlikely  event  that  sit  is  a  random  walk,  so  that 
(£,'/  -  £i,t- i)  are  fid. 

For  T  =  2  the  first-differences  and  within  estimators  are  equal  since  y  =  (_vi  + 
yi)/ 2  SO  (y,  -  y)  =  (yi  -  y2)/ 2  and  (y2  -  y)  =  -(yi  -  y2)/ 2,  and  similarly  for  x. 
For  T  >  2  the  two  estimators  differ.  Under  the  simplest  assumption  that  sit  are  iid,  it 
can  be  shown  that  the  GLS  estimator  of  the  first-difference  model  (21.34)  equals  the 
within  estimator.  The  estimator  /3FD  instead  estimates  (21.34)  by  OLS  and  is  less  effi¬ 
cient  than  /3W.  For  this  reason  the  first-difference  estimator  is  not  mentioned  much  in 
introductory  courses.  However,  it  is  used  extensively  once  lagged  dependent  variables 
are  introduced  (see  Chapter  22).  Then  the  within  estimator  is  inconsistent.  The  first- 
differences  estimator  is  also  inconsistent,  but  relies  on  weaker  exogeneity  assumptions 
that  permit  consistent  IV  estimation. 


21.6.3.  Conditional  ML  Estimator 


The  conditional  MLE  maximizes  the  joint  likelihood  of  yn, . . . ,  y,vr  conditional  on 
the  individual  averages  yi, . . . ,  yr-  This  method  has  the  attraction  that,  for  the  linear 
panel  model  under  normality,  the  fixed  effects  a,  are  eliminated,  so  maximization  is 
with  respect  to  / 3  alone. 

Assume  that  yit  conditional  on  regressors  x„  and  parameters  a,,  (3,  and  a2  are  iid 
with  normal  distribution  J\T[a,  +  x'(/3,  er2].  Then  the  conditional  likelihood  function 
is 


LcondI/T^T  a) 


N 

II  /Oil,  •••,3'irlj,-) 


1  =  1 
N 

n 

i= 1 
N 


n 


f(yn,  •  •  ■ ,  yir,  yi) 

/O’,-) 

(Ino2)-1'2 
(2jto2 /T)-1!2  exp 


(21.38) 


Y  “[O',/  “  x/,/3)2  +  (yi  ~  xj/3)2]/ 2a2 

t=i 


The  first  equality  defines  the  conditional  likelihood  assuming  independence  over  i. 
The  second  equality  always  holds  since,  suppressing  subscript  i,  f(y\, . . . ,  yr|y)  = 
f(yu  ...,yT,  y)/f(y )  and  /(y  i, . . . ,  yT,  y)  =  /(y  i,  . . . ,  yr)  as  knowledge  of  y  = 
T-1  yt  adds  nothing  given  knowledge  of  yi , . . . ,  yr  ■  The  third  equality  under  nor¬ 
mality  comes  after  considerable  algebra  that  is  left  as  an  exercise. 

The  key  result  is  that  the  fixed  effects  a  do  not  appear  in  the  final  equality  in  (2 1.38), 
so  LcondO^ct2,  a)  is  in  fact  LcoxDf/Trr2),  and  we  need  to  maximize  the  conditional 
log-likelihood  function  (21.38)  with  respect  to  (3  and  a2  only.  The  resulting  condi¬ 
tional  ML  estimator  /3CML  solves  the  first-order  conditions 


1  r  n 

—  YI  Y^yii  -  -  (y>  -  s;/3)x,]  =  o, 

°  r=l  i=  1 
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or  equivalently 


T  N 

X!  XXv,'f  -  ~  _  x,)'/3)](x,7  -%)  =  o. 

r=i  ;=i 

However,  these  are  just  the  first-order  conditions  from  OLS  regression  of  (y,-,  —  yd)  on 

(Xit-Xf).  _  _ 

The  conditional  MLE  /3CML  therefore  equals  the  within  estimator  /3W. 

Intuitively,  the  method  yields  a  consistent  estimator  because  conditioning  on 
in  (21.38)  eliminated  the  fixed  effects.  More  formally,  y,  is  a  sufficient  statistic  for 
o',  and  conditioning  on  the  sufficient  statistic  enables  consistent  estimation  of  (3  (see 
Section  23.2.2). 


21.6.4.  Least-Squares  Dummy  Variable  Estimator 

Consider  the  original  fixed  effects  model  (21.22)  before  any  differencing.  An  OLS 
analysis  can  be  applied  directly  to  the  model,  simultaneously  estimating  a  and  (3. 

In  principle  no  special  software  is  needed.  One  simply  estimates  the  OLS  regression 
of  yit  on  xit  and  a  set  of  N  indicator  variables  dlit, . . . ,  d^it,  where  djjt  equals  one 
if  j  =  i  and  equals  zero  otherwise.  However,  as  N  gets  large  there  are  too  many  re¬ 
gressors  to  permit  inversion  of  the  (N  +  K)  x  (N  +  K)  regressor  matrix.  Some  matrix 
algebra,  however,  reduces  the  problem  to  inversion  of  a  A"  x  K  matrix. 

The  resulting  estimator  of  (3  turns  out  to  equal  the  within  estimator.  This  is  a  spe¬ 
cial  case  of  the  so-called  Lrisch- Waugh  Theorem  for  a  subset  regression.  If  dummy 
variables  are  partialled  out  by  regression  of  all  the  variables  on  the  dummies,  and  if 
the  residuals  from  these  regressions  are  used  in  a  second  stage  regression,  then  we  get 
the  same  estimates  as  in  the  full  regression.  But  these  residuals  here  are  simply  devia¬ 
tions  from  their  respective  means,  i.e.  the  within  regression.  Lor  completeness  we  now 
present  the  relevant  matrix  algebra. 

Stack  the  T  x  1  vectors  in  (21.29)  over  all  N  individuals  to  yield  the  fixed  effects 

dummy  variable  model 


or 


yi 

— 

e 

0 

0 

1 

O  O 

Otl 

+ 

"  x,  - 

(3  + 

£\ 

_y  n_ 

_0 

0 

e_ 

_aN  _ 

_XN_ 

_£n  _ 

y  =  [(Ijv  <S>  e)  X] 


a 

P 


e, 


(21.39) 


where  y  is  an  NT  x  1  vector,  the  Kronecker  product  (I#  <g)  e)  is  an  NT  x  N  block- 
diagonal  matrix,  and  X  is  the  NT  x  K  matrix  of  nonconstant  regressors. 
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An  OLS  estimation  of  this  model  yields  the  least-squares  dummy  variable 
(LSDY)  estimator 


ot  LSDV 

"  (Lv  ®  e)'(I,v  ®  e)  (Ijv  ®  e)'X " 

—  1 

Y 

"  (Ijv  ®  e)'y  " 

.  Plsdv  . 

X'(Lv  ®  e)  X'X 

A 

.  X'y  . 

T1n  tx 
TX'  X  X 


y 

x'y 


where  the  matrix  of  sample  means  X  =  [x'j  •  •  •  x'v  ]',  x,  =  T  1  Y^=\  xu>  y  = 
[  v  i  •  •  •  }>n]\  and  y,  =  T  1  Yla=\  >’//  •  Using  the  formula  for  partitioned  inverse  and  per¬ 
forming  further  algebra  leads  to 

«LSDV 

.  /EsDV  . 


y-x/3w 

[X'X  —  X'X]”1  (X'y  —  X'y)  J 


(21.40) 


Reexpressing  this  in  summation  notation,  we  have  /3LSDV  =  @ w  defined  in  (21.24)  and 
ciLSDV  =  cEjjg  defined  in  (21.25),  so  the  LSDV  estimators  equal  the  within  or  fixed 
effects  estimator 

For  short  panels  an  obvious  potential  problem  is  that  consistent  estimation  of  (3 
and  a  is  not  guaranteed  as  there  are  N  +  K  parameters  to  estimate  and  N  — >  oo. 
Remarkably,  consistent  estimation  of  (3  is  possible,  even  though  a  is  inconsistently 
estimated,  unless  additionally  T  — >  oo. 

This  estimator  is  second-moment  efficient  if  sit  are  iid  [0,  a2].  It  follows  that  the 
within  estimator  of  /3  is  more  efficient  than  alternative  differencing  estimators  that 
also  eliminate  cq,  such  as  subtracting  the  first  observation  or  the  previous  period’s 
observation.  If  additionally  the  errors  are  normally  distributed,  the  LSDV  estimator 
equals  the  MLE  by  the  usual  equivalence  of  OLS  and  MLE  in  the  linear  model  with 
spherical  normal  errors. 


21.6.5.  Covariance  Estimator 

Suppose  data  belong  to  one  of  N  classes,  with  yit  denoting  the  /  tli  observation  in  the 
i th  class.  The  analysis  of  variance  decomposes  the  total  variation  of  yit  around  the 
grand  mean  y,  J2,  E/.Vn  -  yf ,  into  within-group  variation  E, (V/r  -  ji  +  >’)2 
and  between-group  variation  E,(U  —  v)2-  where  y,  is  the  mean  in  the  /th  group. 
Group  membership  becomes  more  important  as  between-group  variation  increases. 
The  analysis  of  covariance  extends  this  approach  to  introduce  regressors,  in  which 
case  the  residual  sum  of  squares  is  similarly  decomposed.  This  framework  is  widely 
used  in  applied  statistics. 

For  short  panels  each  individual  is  viewed  as  a  class,  observed  for  several  time 
periods.  The  model  (21.3)  is  called  the  analysis-of-covariance  model,  as  it  permits 
the  mean  residual  in  the  /th  class  to  differ  over  classes.  The  estimator  of  this  model, 
the  within  estimator,  is  accordingly  also  called  the  covariance  estimator. 
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21.7.  Random  Effects  Model 

The  random  effects  model  (21.3)  can  be  rewritten  as 

ylt  =  ji  +  x',/3  +  a,  +  t=\,...,T,  (21.41) 

or 

yn  =  +  a,  +  sit,  (21.42) 

where  w,,  =  [1  x,,]  and  6  =  [/x  0 ]'.  The  individual-specific  effects  a,  are  assumed 

to  be  realizations  of  iid  random  variables  with  distribution  [0,  crj]  and  the  error  e,-r  is 
iid  [0,  or2].  The  nonrandom  scalar  intercept  ji  is  added  so  that,  unlike  in  (21.5),  the 
random  effects  can  be  normalized  to  have  zero  mean. 

The  model  can  alternatively  be  viewed  as  a  special  case  of  a  random  coefficient 
or  varying  coefficient  model,  where  only  the  intercept  coefficient  is  random.  The 
model  can  be  re-expressed  as  yit  =  ji  +  x'(/i  +  uit ,  where  the  error  term  uit  has  two 
components  un  =  a,  +  sit.  For  this  reason  the  random  effects  model  is  also  called  the 
error  components  model.  Even  clearer  terminology  may  be  the  random  intercept 
model.  Richer  mixed  models  also  permit  random  slopes,  see  Chapter  22. 

There  are  many  consistent  estimators  of  the  random  effects  model,  including  (1) 
GLS  estimation  in  the  model  (21.42);  (2)  ML  estimation  in  the  model  (21.42)  assum¬ 
ing  o',  and  su  are  normally  distributed;  (3)  OLS  estimation  in  the  model  (21.42);  and 
(4)  fixed  effects  model  estimators  such  as  the  within  and  first-differences  estimators, 
though  these  only  estimate  the  coefficients  of  time-varying  regressors.  The  first  two 
estimators  are  asymptotically  equivalent  but  can  vary  in  finite  samples  depending  on 
the  specific  estimates  used  for  a2  and  <t2.  The  remaining  estimators  are  consistent, 
though  they  are  inefficient  if  in  fact  a,  and  sit  are  iid. 


21.7.1.  GLS  Estimator 


The  random  effects  estimator  of  /x  and  /3  is  the  feasible  GLS  estimator  of  the  model 
(21.42),  and  it  is  shown  later  in  this  section  that  it  can  be  implemented  by  OLS  regres¬ 
sion  of  the  transformed  equation 

yu  -  —  (1  -  k)/x  +  (xlt  -  ax,)'/3  +  vit,  (21.43) 


where  Vjt  =  (1  —  A.)a,-  +  ( sit  —  Xsj)  and  X  is  consistent  for 

Equivalently, 


1=1-  aEl(Ja-  +  a;)1'2. 


(21.44) 


<5rf.  = 


Mre 

_/3  RE. 

X!  X/W/' -  ^xw;,  -  iw,)' 


1  =  1  1  =  1 


-1 


XI  X(w,,f  _  ^w')o,'f _  00 

(21.45) 


;= l  i=i 


where  w,r  =  [1  x,,]  and  w,  =  [1  x,].  Consistency  requires  NT  oo,  through  either 
N  — >  oo  or  T  -a-  oo  or  both. 


734 


21.7.  RANDOM  EFFECTS  MODEL 


Assuming  that  sit  and  or,-  are  iid,  the  usual  OLS  output  from  OLS  regression  of 
(21.43)  can  be  used  to  obtain  the  variance  matrix  estimate,  so  that 


V 


A  RE 
_  @RE  _ 


r  N  T 


^2  E^ _  Aw')(w<'  - Aw;)' 

;  =  1  t=  1 


(21.46) 


Alternatively,  for  short  panels  a  robust  variance  estimate  that  permits  quite  general 
behavior  for  a,  +  e,- ,  can  be  obtained  using  (21.13).  This  yields 
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where  w,r  =  w,;  —  Aw,  and  e,-f  =  e,,  —  Ac,-  where "e,-,  is  the  RE  residual.  This  estimate 
permits  arbitrary  autocorrelations  for  the  e,-f  and  arbitrary  heteroskedasticity. 

Equation  (21.46)  requires  consistent  estimates  of  the  variance  components  ay  and 
ay.  From  the  within  or  fixed  effects  regression  of  (yit  —  y, )  on  (xit  —  x,)  we  obtain 


*2  = 


^riEEfc  -  y{)  -  (x,f  -  x,)'^w)2 


N(T  -  1)-  K 


(21.48) 


From  the  between  regression  of  y,  on  an  intercept  and  x,  ,  an  equation  that  has  error 
term  with  variance  ay  +  a,2  /  7’,  we  obtain 


a2  = 


1 


N-(K+l) 


Eo* _  -  x;3b)2  -  ^2 


(21.49) 


More  efficient  estimators  of  the  variance  components  cr2  and  cr2  are  possible  (see,  for 
example,  Amemiya,  1985),  but  these  will  not  necessarily  increase  the  efficiency  of 
/3re.  A  wide  range  of  estimators  are  possible.  The  variance  estimator  (21.49)  can  be 
negative,  in  which  case  programs  often  set  cr2  =  0,  so  A  =  0  and  estimation  is  then  by 
pooled  OLS. 

To  verify  that  the  feasible  GLS  estimator  simplifies  to  OLS  estimation  of  (21.43), 
stack  (21.42)  by  observations  from  all  T  time  periods  for  given  i  in  the  same  way  as 
for  the  fixed  effects  model.  Then 


y  /  =  W,d  +  (ear,-  +  £;), 


(21.50) 


where  y,  ,  e,  and  X,  are  defined  after  (21.29),  and  YV'  =  [e  X'  | .  To  estimate  by 
GLS  we  need  to  obtain  the  variance  matrix  12  of  the  T  x  1  vector  error  (ear,-  +  £,). 
Given  independence  of  a,-  and  e,-f  we  have  E[(eor;  +  e,-)(ear;  +  £,)']  =  E[£,£-] + 
E[o'2]ee/.  Since  are  iid  [0,  ay  ]  and  or,-  are  iid  [0,  cr2]  we  obtain 

Q+^dr-Q)  , 

where  Q  =  IT  —  T~1ee/  was  introduced  in  (21.30)  and  i/r2  =  a2 /[ay  +  Toy],  Using 
QQ'  =  Q  we  can  easily  verify  that  f2_1  =  cr“2[Q  +  i/r2(Ir— Q)]  and 

n-1/2  =  -[Q  +  ^ar-Q)].  (21.51) 
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The  GLS  estimator  is  obtained  by  premultiplication  of  (21.50)  by  any  scalar  multiple 
of  f2~1/2.  Now 


[Q  +  VKIj-Q)]  y ;  =  y,  -  ey,'  +  -  (y,  -  ey-)) 

=  y»  - 

where  X  =  (1  —  xjr).  Performing  similar  algebra  for  W,,  ca, .  and  e,  in  (21.50)  yields 
the  following  model: 

y,-  -  tey'i  =  (W/  -  kW/6  +  (1  -  X)a,  +  (e,  -  Xes[),  (21.52) 

where  the  transformed  error  in  (21.52)  has  variance  matrix  rr2Iy.  The  GLS  estimator 
is  the  OLS  estimator  of  (21.52),  but  (21.52)  is  just  a  stacked  version  of  (21.43)  with 
the  scalar  X  replaced  by  a  consistent  estimate. 

The  random  effects  estimator  /3RH  of  the  slope  parameters  converges  to  the  within 
estimator  as  T  oo  since  then  X  —>■  1.  Otherwise,  f3RE  can  be  shown,  after  some 
algebra,  to  equal  a  matrix-weighted  combination  of  the  within  estimator  and  the 
between  estimator.  If  the  random  effects  model  is  appropriate,  this  weighted  average 
works  better  than  using  the  within  estimator  alone.  However,  if  the  fixed  effects  model 
is  appropriate  then  this  weighted  average  is  inconsistent,  as  the  between  estimator  is 
then  inconsistent.  The  estimator  of  the  intercept  can  be  shown  to  simplify  to  JiRl  = 
y  —  X/3re.  For  more  details  see,  for  example,  Hsiao  (2003,  p.  36)  or  Greene  (2003). 


21.7.2.  ML  Estimator 

In  the  derivation  in  the  previous  section,  normality  of  the  errors  is  not  assumed.  If  they 
are  in  fact  normal,  we  can  maximize  the  log-likelihood  function  with  respect  to  /3,  /!, 
rr2,  and  ct2.  For  given  ct2  and  ct2  the  MLE  for  (3  and  /i  is  the  same  as  the  GLS  estimator, 
but  the  MLE  gives  estimators  ct2  and  ct2  that  differ  from  those  given  in  (21.48)  and 
(21.49). 

Thus  the  MLE  for  (3  and  ji  is  given  by  (21.45)  with  X  replaced  by  the  alternative 
consistent  estimate  X  =  1  —  a£/(Ta 2  +  ct2)1/2.  Asymptotically,  the  MLE  and  GLS 
estimators  of  the  random  effects  model  are  equivalent,  but  the  two  will  differ  in  finite 
samples. 

For  the  MLE  there  may  be  two  local  maxima  rather  than  one  of  the  likelihood  for 
0  <  i//2  <  1,  so  care  is  needed  to  ensure  a  global  maximum. 


21.7.3.  Other  Estimators 

Many  different  estimators  of  (3  are  consistent  if  the  random  effects  model  is  the  cor¬ 
rect  model.  In  particular,  the  pooled  OLS,  within,  first-differences,  and  between  es¬ 
timators  are  all  consistent.  However  they  are  inefficient  if  o',  and  e(f  are  iid,  and 
the  within  and  first-differences  estimators  can  only  estimate  the  coefficients  of  time- 
varying  regressors. 
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21.8.  Modeling  Issues 

In  this  section  we  consider  some  practical  issues  that  arise  in  linear  panel  data  mod¬ 
els,  even  in  the  absence  of  complications  such  as  endogeneity  and  lagged  dependent 
variables,  topics  that  are  deferred  to  Chapter  22. 

21.8.1.  Tests  for  Pooling 

The  random  effects  model  restricts  all  regression  parameters  to  be  the  same  in  different 
cross  sections  and  time  periods,  whereas  the  fixed  effects  models  imposes  parameter 
constancy  except  for  the  intercept,  which  may  vary  across  individuals.  Tests  of  poola- 
bility  test  the  appropriateness  of  these  constraints. 

These  tests  are  usually  done  using  a  Chow  test  (see  Greene,  2003,  p.  130)  based 
on  the  tests  for  equality  of  regressors  in  two  linear  regressions  assuming  a  common 
variance.  Depending  on  the  assumptions  about  errors,  the  Chow  test  may  be  applied 
to  models  estimated  by  OLS  or  by  GLS.  Baltagi  (2001,  Chapter  4)  and  Hsiao  (2003, 
Chapter  2)  provide  detailed  coverage. 

For  short  panels  it  is  not  possible  to  allow  the  slope  parameters  to  differ  across 
individuals,  as  then  the  number  of  parameters  goes  to  infinity.  However,  parameters  can 
be  permitted  to  vary  over  time.  The  model  y(/  =  y  +  x-r/3  +  u,,  is  then  tested  against 
yir  =  y,  +  x'(/3(  +  u jt .  The  most  obvious  method  is  to  assume  random  effects  with 
uu  =  £it+  oii,  estimate  the  restricted  model  (y,  =  y  and  (3t  =  (3)  using  the  random 
effects  GLS  estimator,  and  compare  the  restricted  and  unrestricted  residual  sums  of 
squares  in  the  transformed  models.  If  more  robust  inference  is  preferred  then  panel- 
robust  standard  errors  should  be  obtained  and  a  Wald  test  performed.  For  short  panels 
it  is  common  to  specify  models  with  slope  parameters  (3  constant,  though  the  intercept 
y,  may  be  permitted  to  vary  over  time  by  inclusion  of  time  dummies  as  additional 
regressors. 


21.8.2.  Tests  for  Individual-Specific  Effects 

Breusch  and  Pagan  (1980)  derived  Lagrange-multiplier  tests  for  the  presence  of 
individual-specific  random  effects  against  the  null  hypothesis  assumption  of  iid  er¬ 
rors.  These  have  the  advantage  of  being  easily  implemented  by  an  auxiliary  regression 
that  requires  only  residuals  from  pooled  OLS  estimates.  Alternatively,  one  can  assume 
normality  and  do  a  likelihood  ratio  test  of  the  random  effects  MLE  against  the  MLE  of 
the  constant-coefficients  model,  or  a  Wald  test  of  cra  =  0  in  the  random  effects  model. 

In  practice  one  often  rejects  the  null  hypothesis  that  the  errors  in  the  constant- 
coefficients  model  are  iid.  It  is  easiest  to  immediately  estimate  by  pooled  OLS  with 
panel-robust  standard  errors  or  by  random  effects  GLS. 

For  a  short  panel  formal  tests  for  the  presence  of  individual-specific  fixed  effects 
are  not  possible  because  of  the  incidental  parameters  problem.  It  is  not  possible  to 
test  whether  N  parameters  are  zero  when  there  are  only  NT  observations  and  T  is 
small.  Instead,  the  Hausman  test  of  Section  21.4.3  is  used  to  test  the  null  hypothesis  of 
random  effects  against  the  alternative  of  fixed  effects. 


737 


LINEAR  PANEL  MODELS:  BASICS 


21.8.3.  Prediction 

Prediction  in  models  without  individual  effects  is  straightforward:  Use  y/s  =  x'v/3. 
This  is  a  prediction  of  the  population  average  E[y/S  |x/s  ]. 

Prediction  for  a  given  individual  conditional  on  the  individual-specific  effect  is  more 
difficult.  This  is  prediction  of  E[yyi|x;s,  a,-].  We  consider  out-of-sample  forecasts  for 
the  ith  individual  using  the  random  effects  model  (21.42).  Then  v,  t+s  =  w 'jt6  +  uiJ+s, 
where  UjJ+s  =  a,  +  e,  ,+v.  The  obvious  predictor  replaces  6  by  <5RE  and  Ujj+s  by  ei¬ 
ther  0  or  where  if,  =  y,  —  w-6re  is  the  average  within-sample  residual  for  the  ith 
individual.  However,  this  is  inefficient  as  it  ignores  the  correlation  between  «,  ,+v  and 
in-sample  errors  induced  by  the  individual-specific  random  effect  a,  .  The  problem  is 
an  example  of  the  more  general  problem  of  prediction  within  a  GLS  rather  than  an  OLS 
framework.  For  this  special  case  the  best  linear  unbiased  predictor  (see  Section  22.8.3) 
is  yi,/+s  =  xJ,6re  +  (T o 1/{T ct2  +  cr2))w).  For  the  fixed  effects  model  the  obvious  pre¬ 
dictor  is  %t+s  =  x'itl 3W  +  o' ,  i  [_ .  but  again  this  is  inconsistent  in  short  panels. 


21.8.4.  Two-Way  Effects  Models 

The  analysis  to  date  has  focused  on  the  one-way  model,  which  is  (21.1)  with  un  = 
a,  +  Su ■  A  more  general  model  is  the  two-way  effects  model,  with  «(/  =  a,  +  y,  + 
eu,  which  additionally  allows  for  time-specific  effects.  Then 

yu  =  <*i  +  Yt  +x'it/3  +  sit,  i  =  I - -  N ,  r=t,...,r.  (21.53) 

This  model  was  presented  originally  in  (21.2). 

As  already  noted,  for  short  panels  the  usual  approach  is  to  treat  the  time-specific 
effects  as  fixed  and  estimate  them  as  the  coefficients  of  time  dummies  that  are  included 
in  the  regressors,  with  analysis  then  differing  according  to  whether  the  individual- 
specific  effects  are  treated  as  fixed  or  random. 

If  both  o',  and  y,  are  fixed  then  the  OLS  estimator  of  (3  in  (21.53)  is  equivalent  to 
regression  of  yit  -  yt  -  yt  +  y  on  xit  -  x,  -  x,  +  x,  where  y,-  =  T  1  yu,  >’/  = 
iW1  yu,  and  y  =  (NT)~l  l  yu,  with  similar  definitions  for  x,-,  xf,  and 

x.  This  method  of  estimation  is  convenient  if  T  is  large. 

If  instead  both  a,  and  y,  are  random  then  the  error  term  will  have  a  component  y, 
that  induces  error  correlation  across  individuals,  whereas  we  have  focused  on  inde¬ 
pendence  over  i.  It  can  be  shown  that  the  GLS  estimator  can  be  computed  by  OLS 
estimation  of  y*  on  a  constant  and  x* , 

y*t  =  yu  -  A i  vf-  -  i2yt  +  k3y, 

where  y,  ,  y,.  and  y  have  already  been  defined  and  x*  is  defined  analogously  to  y* . 
For  this  and  other  results  for  the  two-way  effects  model  see  Hsiao  (2003)  or  Baltagi 
(2001). 


738 


21.8.  MODELING  ISSUES 


21.8.5.  Unbalanced  Panel  Data 

The  discussion  thus  far  has  assumed  the  panel  is  balanced,  meaning  that  data  are 
available  for  every  individual  in  every  year.  For  panel  data  on  different  regions  this 
is  often  the  case.  In  contrast,  for  panel  surveys  of  individuals  there  is  usually  a  drop 
off  or  attrition  over  time  in  the  proportion  of  individuals  still  answering  the  survey. 
Moreover,  some  individuals  may  miss  one  or  more  periods  but  return  later,  in  some 
cases  by  design  as  in  the  case  of  rotating  panels  such  as  the  CPS,  where  households 
are  surveyed  for  four  consecutive  months,  not  surveyed  for  eight  months,  and  then 
surveyed  for  another  four  months.  Such  panels  where  different  individuals  appear  in 
different  years  are  called  unbalanced  panels  or  incomplete  panels. 

Let  djt  be  an  indicator  variable  equal  to  one  if  the  ifth  observation  is  observed  and 
equal  to  zero  otherwise.  Then  for  the  individual-specific  effects  model  (21.3)  the  FE 
estimator  is  consistent  if  the  strong  exogeneity  assumption  (21.4)  becomes 

E[ujt\aj,  Xu, . . . ,  xiT,dn,  . . . ,  diT]  =  0,  (21.54) 

and  the  RE  estimator  is  consistent  if  additionally  a,  is  independent  of  the  other  con¬ 
ditioning  variables.  The  fixed  and  random  effects  estimators  can  then  be  applied  to 
unbalanced  data  with  relatively  little  adjustment.  This  should  be  clear  from  the  ini¬ 
tial  presentation  of  the  estimators  as  OLS  estimators  in  various  models  given  in 
Section  21.2.2.  For  example,  for  the  random  effects  model  replace  X  in  (21.10)  by 
kj  =  1  —  cr£/(Tjcr 2  +  cr2)1/2,  where  7)  is  the  number  of  observations  for  individual  i 
(see  Baltagi,  1985,  and  Wansbeek  and  Kapteyn,  1989).  Davis  (2002)  considers  multi¬ 
way  random  effects  models.  For  the  fixed  effects  model  an  individual  observation  must 
be  observed  at  least  twice  in  the  sample  and  degrees  of  freedom  must  be  appropriately 
adjusted.  Baltagi  (2001)  gives  a  lengthy  treatment  of  unbalanced  panels.  Economet¬ 
rics  packages  that  estimate  the  more  standard  of  the  panel  models  presented  in  Chap¬ 
ters  21-23  usually  automatically  handle  missing  observations. 

At  times  it  may  be  convenient  to  convert  an  unbalanced  panel  into  a  balanced  panel, 
by  including  in  the  sample  only  those  individuals  with  data  available  in  all  years.  This 
obviously  can  greatly  reduce  efficiency  because  of  the  loss  of  many  observations.  Fur¬ 
thermore,  if  data  are  not  randomly  missing  this  can  exacerbate  potential  problems  of  a 
nonrepresentative  sample. 

One  reason  for  missing  data  can  be  that  although  most  variables  are  observed,  at 
least  one  variable  is  not.  For  example,  the  nonresponse  rate  to  income  questions  can 
be  quite  high.  Rather  than  drop  an  entire  observation  because  data  for  one  regressor, 
income,  is  missing  there  may  be  efficiency  gains  to  using  the  imputation  methods 
presented  in  Chapter  27. 

Unbalanced  panels  require  special  methods  if  the  reason  for  individuals  dropping 
out  of  the  sample  is  correlated  with  the  error  term,  so  that  (21.54)  does  not  hold.  For 
example,  those  individuals  with  unusually  low  wages  (after  controlling  for  observed 
characteristics)  may  be  more  likely  to  drop  out  of  a  panel  sample.  The  result  is  an 
unrepresentative  panel  that  will  lead  to  attrition  bias  if  wage  is  the  dependent  variable. 
Consistent  estimation  requires  use  of  sample  selection  methods  extended  to  panel  data 
(see  Section  23.5.2). 
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21.8.6.  Measurement  Error 

Measurement  error  in  regressors  leads  to  inconsistent  parameter  estimates  in  cross- 
section  regression  models.  If  panel  data  methods  are  used  that  involve  differencing  of 
the  data,  the  result  may  be  a  large  increase  in  the  inconsistency  caused  by  measurement 
error  depending  on  the  assumptions  made  about  the  dgp.  This  is  pursued  in  Chapter  26. 


21.9.  Practical  Considerations 

The  various  estimators  presented  in  this  chapter  are  easily  implemented.  The  most 
foolproof  method  is  to  use  the  panel  commands  available  in  econometric  packages 
such  as  LIMDEP,  STATA,  and  TSP,  all  of  which  have  the  added  advantage  of  usually 
handling  unbalanced  panels.  Most  estimators  can  alternatively  be  estimated  using  an 
appropriate  pooled  OLS  regression  on  transformed  data  that  requires  only  a  cross- 
section  package,  though  standard  errors  may  then  differ  from  panel  package  standard 
errors  because  the  latter  may  ignore  autocorrelation  induced  by  transformation  and 
may  use  different  degrees  of  freedom. 

A  weakness  of  panel  commands  in  packages  is  that  they  currently  compute  standard 
errors  based  on  restrictive  distributional  assumptions  such  as  iid  errors  in  the  fixed 
effects  models,  and  iid  individual  effect  and  iid  errors  in  the  random  effects  model.  To 
compute  the  more  robust  standard  error  estimates  presented  in  this  chapter  may  require 
panel  estimation  with  a  panel  bootstrap  or  estimation  of  an  appropriate  pooled  OLS 
regression  using  an  option  to  compute  cluster-robust  standard  errors. 

In  microeconometric  analysis  there  is  a  fundamental  distinction  between  models 
with  and  models  without  fixed  effects.  If  a  model  without  fixed  effects  is  preferred 
it  should  be  justified  by  passing  a  Hausman  test.  If  this  test  rejects  the  random  ef¬ 
fects  model  then  it  may  still  be  possible  to  consistently  estimate  coefficients  of  time- 
invariant  regressors  using  the  instrumental  variables  methods  presented  in  the  next 
chapter. 


21.10.  Bibliographic  Notes 

Most  textbooks,  such  as  Greene’s  (2003),  include  at  least  a  chapter  on  panel  data  models. 
Wooldridge  (2002)  has  several  chapters  that  cover  both  linear  and  nonlinear  panel  models. 
Econometrics  monographs  on  panel  data  include  those  by  Hsiao  (1986,  2003),  Baltagi  (1995, 
2001),  Matyas  and  Sevestre  (1995),  M-J.  Lee  (2002),  and  Arellano  (2003).  The  last  three  books 
place  greater  emphasis  on  the  methods  presented  in  Chapter  22  and  23.  Diggle,  Liang,  and 
Zeger  (1994,  2002)  is  a  standard  statistics  reference. 

21.4  Mundlak  (1978)  wrote  a  classic  article  on  fixed  versus  random  effects  models.  Hausman 
(1978)  used  tests  between  these  two  models  to  illustrate  his  testing  approach. 

21.6  Kuh  (1959)  and  Hoch  (1962)  provide  two  early  panel  data  applications  to  estimation  of 
investment  functions  and  of  Cobb-Douglas  production  functions.  These  studies  contrast 
use  of  within  estimates  using  time-series  variation  and  between  estimates  using  cross- 
section  variation. 
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- Exercises - 

21-1  (Adapted  from  Baltagi,  1999)  Consider  the  panel  model  yit  =  a  +  pxit  +  Ujt, 
where  a  and  p  are  scalars. 

(a)  Show  by  appropriate  subtraction  that  this  model  implies 

Yit- y=  P(Xit  -  Xi)  +  p(x,  -  x)  +  (uit  -  u), 

where  y  =  (A/T)”1  £,■ ,  Yit,  x  =  (A/T)'1  J2itxit  and  x,  =  7"_1  £,x(,. 

(b)  For  the  corresponding  unrestricted  least-squares  regression 

Yt~  Y=  Pi  ( xit  -  X,)  +  p2(xi  -  x)  +  (uit  -  u), 

show  that  the  least-squares  estimator  of  fa  is  the  within  estimator  and  that 
of  fa  is  the  between  estimator. 

(c)  Show  that  if  uu  =  /x,  +  vn,  where  in  ~iid[0,  a2]  and  vjt  — iid[0,  af;],  and  the 
two  are  mutually  independent  across  both  /  and  t,  the  OLS  and  the  GLS 
estimators  are  equivalent. 

21-2  Consider  estimation  of  the  fixed  effects  linear  regression  model  yit  =  a,-  +  x'it/3  + 
Sjt,  where  a,  are  fixed  effects  possibly  correlated  with  xit.  Stacking  all  T  observa¬ 
tions  for  individual  /  yields  y,  =  a,e  +  X,/3  +  e,-  (see  (21 .29)  for  definitions).  Con¬ 
sider  the  estimator  f3  =  X-J'JX,-]-1  x  J^iLt  X'-J'Jy ,,  where  J  is  a  T  x  T 

matrix  of  known  constants  such  that  Je  =  0.  [Note  that  an  example  of  J  is 
Q  =  I t-  T~'ee'.] 

(a)  Provide  a  motivation  for  the  estimator  f3. 

(b)  Find  E[/3].  For  simplicity  assume  that  X,-  are  fixed  regressors  and  that  sit  are 
iid  [0.  a2].  Is  (5  unbiased  for  /3? 

(c)  Find  V[/3].  For  simplicity  assume  that  X,-  are  fixed  regressors  and  that  e,(  are 
iid  [0.  a2]. 

(d)  Now  suppose  sn  are  independent  over  /  but  correlated  over  t  with  V[e,]  =  Q,-. 
Give  V[3], 

(e)  Suppose  that  the  effects  a,  are  random  (0,  a2)  rather  than  fixed.  Would  the 
estimator  in  this  exercise  be  consistent? 

21-3  (Adapted  from  Baltagi,  1 998)  Consider  the  fixed  effects,  two-way  error  compo¬ 
nent  panel  data  model 

yn  —  a  +  x'jfl 3  +  Hj  +  Xf  +  eh, 

where  a  is  a  scalar,  xit  is  a  k  x  1  vector  of  exogenous  regressors,  f3  is  a  K  x  1 
vector,  /x  and  k  denote  fixed  individual  and  time  effects,  respectively,  and  e/f  ~ 
iid[0,  a2]. 

(a)  Show  that  the  within  estimator  of  f3,  which  is  best  linear  unbiased,  can  be 
obtained  by  applying  two  within  (one-way)  transformations  on  this  model. 
The  first  is  the  within  transformation  ignoring  the  time  effects  followed  by  the 
within  transformation  ignoring  the  individual  effects. 

(b)  Show  that  the  order  of  these  two  within  (one-way)  transformations  is  unim¬ 
portant.  Give  an  intuitive  explanation  for  this  result. 

21-4  Use  a  50%  random  subsample  of  the  wage-hours  data  in  Section  21 .3 
(a)  Can  p  be  directly  interpreted  as  a  labor  supply  elasticity?  Explain. 
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(b)  For  the  following  estimators:  (1)  pooled  OLS,  (2)  between,  (3)  within,  (4)  first 
differences,  (5)  random  effects  GLS,  (6)  random  effects  MLE  give  (i)  p  (esti¬ 
mated  coefficient  of  Inwg),  (ii)  default  standard  error,  and  (iii)  panel  bootstrap 
standard  error  with  200  replications. 

(c)  Are  the  estimates  of  p  similar? 

(d)  Is  there  a  systematic  difference  between  default  standard  errors  and  panel- 
robust  standard  errors? 

(e)  Will  the  pooled  OLS  estimator  in  part  (b)  be  consistent  for  p  in  a  fixed  effects 
model?  Will  the  pooled  OLS  estimator  be  consistent  for  p  in  a  random  effects 
model? 

(f)  Perform  a  Hausman  test  of  the  difference  between  the  fixed  and  random 
effects  (GLS)  estimates  of  p  in  this  model.  Do  this  manually  using  the  earlier 
regression  output  with  the  default  standard  errors.  What  do  you  conclude 
and  which  model  is  favored? 

(g)  Given  the  preceding  evidence,  do  you  believe  that  the  labor  supply  curve  is 
upward  sloping?  Explain. 
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Linear  Panel  Models:  Extensions 


22.1.  Introduction 

The  previous  chapter  presented  variants  of  the  linear  panel  data  model  with  a  fixed 
or  random  intercept  and  regressors  that  are  strongly  exogenous.  Now  we  move  on  to 
various  extensions  for  linear  models,  with  focus  on  relaxation  of  the  strong  exogene¬ 
ity  assumption  to  permit  consistent  estimation  of  models  with  endogenous  variables 
and/or  lagged  dependent  variables  as  regressors. 

The  use  of  instrumental  variables  is  a  standard  method  to  handle  endogenous  re¬ 
gressors.  It  is  much  easier  to  obtain  instruments  with  panel  data  than  with  cross-section 
data,  since  exogenous  regressors  in  other  time  periods  can  be  used  as  instruments  for 
endogenous  regressors  in  the  current  time  period.  The  only  complication  is  to  first 
control  for  any  fixed  or  random  effects. 

Panel  data  permit  regressors  to  additionally  include  lagged  dependent  variables, 
data  unavailable  with  a  single  cross  section.  This  permits  estimation  of  dynamic  mod¬ 
els  that  distinguish  between  persistence  of  earnings,  for  example,  as  the  result  of  vari¬ 
ation  around  an  unobserved  individual-specific  effect,  as  in  Chapter  21,  and  persis¬ 
tence  caused  by  the  outcomes  of  previous  periods  directly  determining  the  outcome 
of  the  current  period.  The  estimators  of  Chapter  21  that  control  for  individual-specific 
effects  become  inconsistent,  however,  if  lagged  dependent  variables  are  regressors.  In¬ 
strumental  variables  estimation  using  longer  lags  as  instruments  leads  to  consistent 
estimation. 

Panel  data  provide  an  excess  of  moment  conditions  available  for  estimation,  owing 
to  an  abundance  of  instruments,  and  panel  model  errors  are  usually  not  iid.  The  nat¬ 
ural  estimation  framework  is  that  of  panel  GMM,  presented  in  detail  in  Section  22.2 
and  illustrated  with  an  application  to  estimation  of  the  labor  supply  elasticity  in  Sec¬ 
tion  22.3.  Further  details  on  estimation  with  individual-specific  effects  and  regressors 
that  are  endogenous  or  lagged  dependent  variables  are  presented  in  Sections  22.4  and 
22.5.  The  discussion  is  quite  extensive  due  to  the  many  possible  variations  that  are 
covered.  These  include  the  presence  of  individual  specific  effects  that  may  be  fixed  or 
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random,  different  exogeneity  assumptions,  and  models  that  may  be  just-identified  or 
over-identified. 

The  remainder  of  this  chapter  considers  other  stand-alone  topics  that  generally  do 
not  require  reading  of  Sections  22.2-22.5.  Models  closely  related  to  panel  data  models 
are  presented  in  Sections  22.6-22.8,  namely  repeated  cross-section  data,  differences 
in  differences,  and  hierarchical  models. 

22.2.  GMM  Estimation  of  Linear  Panel  Models 

The  panel  regression  models  in  Chapter  21  restricted  the  scalar  dependent  variable  yit 
to  depend  on  just  the  contemporaneous  value  of  regressors  xit,  even  though  potentially 
all  of  x,i, . . . ,  xiT  could  be  regressors  under  the  Chapter  21  assumption  of  strong  ex¬ 
ogeneity.  This  introduces  the  possibility  of  more  efficient  estimation  using  excluded 
regressors  from  other  periods  as  instruments  in  the  current  period. 

Furthermore,  regressors  in  other  periods  may  be  valid  instruments  for  current- 
period  regressors  that  are  either  endogenous  or  lags  of  the  dependent  variable.  So  in¬ 
struments  are  readily  available  to  permit  consistent  IV  estimation  in  situations  where 
failure  of  the  strong  exogeneity  assumption  leads  to  inconsistency  of  the  Chapter  21 
estimators. 

This  section  provides  a  general  presentation  of  panel  GMM  estimation,  a  very  use¬ 
ful  framework  for  panel  IV  estimation  that  is  used  extensively  throughout  Sections 
22.2-22.5.  Then  we  introduce  the  use  of  exogenous  variables  (regressors  or  instru¬ 
ments)  in  periods  other  than  the  current  period  as  an  instrument.  Once  this  ground¬ 
work  is  laid  it  is  a  relatively  minor  adaptation  to  incorporate  fixed  or  random  effects, 
typically  included  in  panel  models.  This  is  deferred  to  subsequent  sections. 

22.2.1.  Panel  GMM 
Consider  the  linear  panel  model 

yu  =  x-,/3  +  Ui,,  (22.1) 

where  the  regressors  xit  may  have  both  time-varying  and  time-invariant  components 
and  may  include  an  intercept.  Here  there  is  no  individual-specific  effect  or,-,  an  as¬ 
sumption  relaxed  from  Section  22.3  on,  and  xlt  is  assumed  to  include  only  current- 
period  variables,  an  assumption  relaxed  in  Section  22.5.  Observations  are  assumed  to 
be  independent  over  i  and  a  short  panel  with  T  fixed  and  N  — >  oo  is  assumed. 

Begin  by  stacking  all  T  observations  for  the  ith  individual, 

y,  =  X,/3  +  u,,  (22.2) 

where  y,  and  u,  arc  T  x  I  vectors  and  X,  is  a  T  x  K  matrix  with  fth  row  x.f,  so 


y<  i 

"*;r 

Mil 

y,  = 

;  x,-  = 

;  u,  = 

_y;7-_ 
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The  model  (22.2)  defines  a  linear  system  of  equations,  so  the  results  of  Section  6.9.5 
for  systems  IV  estimation  with  data  independent  over  i  are  directly  applicable. 

Assume  the  existence  of  a  I  x  r  matrix  of  instruments  Z where  r  >  K  is  the 
number  of  instruments,  that  satisfy  the  r  moment  conditions 

E[Z-U/]  =  0.  (22.3) 

The  GMM  estimator  based  on  these  moment  conditions  minimizes  the  associated 
quadratic  form 

N  1'  Tv 

<2ak/3)=  £z:u‘-  w,  ’ 

_i=  1  _  _  /=1 

where  W n  denotes  an  r  x  r  weighting  matrix.  Given  u,  =  y,  —  X,  /3,  some  algebra 
yields  the  panel  GMM  estimator 

ppgmm  =  (e x;z-) (e z;x-)  (e x;z<) (e z^- 

The  essential  condition  for  consistency  of  this  estimator  is  assumption  (22.3). 

In  many  applications  Z,  is  composed  of  current  and  lagged  values  of  exogenous 
regressors.  For  example,  suppose  all  regressors  are  contemporaneously  exogenous. 
Then  E[x,,u(r]  =  0  implies  (22.3)  with  Z,  =  [ x' , . . .  x'y  ].  In  this  case  the  model  is 
just  identified  and,  since  Z,  =  X,,  /3PGMM  simplifies  to  the  pooled  OLS  estimator  of 
Chapter  21.  If  it  is  additionally  assumed  that  E[x,,_  |  uu  |  =  0,  then  is  available  as 
additional  instruments  for  the  it th  observation,  the  model  is  over-identified,  and  more 
efficient  estimation  is  possible  using  the  PGMM  estimator. 

The  use  of  various  exogeneity  assumptions  to  form  the  instrument  matrix  Z,  is 
detailed  in  Section  22.2.4.  The  analysis  requires  adaptation  in  panel  data  models  with 
individual-specific  effects  a,.  This  is  illustrated  in  an  empirical  application  in  Sec¬ 
tion  22.3  and  is  dealt  with  explicitly  in  Sections  22.4  and  22.5. 

22.2.2.  Panel-Robust  Statistical  Inference 

To  express  the  distribution  of  the  panel  GMM  estimator  it  is  convenient  to  use  more 
compact  notation.  Rewrite 

3pgmm  =  [X'ZWjvZ'X]-1X'ZW)vZ'y,  (22.4) 

where  X'  =  [X)  •  •  •  X'A,],  Z  =  [Z\  ■  ■  ■  Z'N],  and  y'  =  [y)  •  •  •  y'v|.  Then  3PGMM  is 

asymptotically  normal  with  estimated  asymptotic  variance  matrix 

V[3pgmm]  =  [X,ZWNZ,XrlX,ZWN(NS)WNZ,X[X'ZWNZ,X]-1,  (22.5) 

see  Equation  (6.97),  where  S  is  a  consistent  estimate  of  the  r  x  r  matrix 

1  N 

s=p|im-EZ;u'u/Z‘’  (22.6) 

1=1 
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and  independence  over  i  has  been  assumed.  The  essential  assumption  for  this  result  is 
that  jV_1/2Z'u  =  N~1/2  JT  Z^u,  A/"[0,  S].  A  White-type  robust  estimate  of  S  is 

-  1  N 

S=M  Ez^z«->  (22-7) 

'V  i=l 

where  the  T  x  I  estimated  residual  u,  =  y,  —  X,/3. 

The  estimate  (22.5)  yields  panel-robust  standard  errors  allowing  for  both  het- 
eroskedasticity  and  correlation  over  time.  Alternatively,  the  panel  bootstrap  could  be 
used.  For  further  discussion  see  Section  21.2.3  where  the  same  issues  apply. 


22.2.3.  One-Step  and  Two-Step  Panel  GMM 

Different  full-rank  weighting  matrices  in  (22.4)  lead  to  different  systems  GMM 
estimators,  except  in  the  just-identified  case  of  r  =  K  when  the  PGMM  estimator 
simplifies  to  the  IV  estimator  [Z'Xp'Z'y  for  any  W,v.  The  discussion  mirrors  that  in 
Section  6.4.2.  The  two  leading  choices  of  are  given  here. 


One-Step  GMM 

The  one-step  GMM  or  two-stage  least-squares  estimator  uses  weighting  matrix 
WN  =  [£,  Z'Z,]-1  =  [Z'Z]-1,  leading  to 

32SLS  =  [X'Z(Z'Z)- 1  Z'X]  “ 1  X'Z(Z'Z)- 1  Z'y .  (22.8) 

The  motivation  for  this  estimator  is  that  it  can  be  shown  to  be  the  optimal  PGMM 
estimator  based  on  (22.3)  if  u,  |Z,  is  iid  [0,  cr2l7-]. 

This  estimator  is  called  one-step  GMM  because  given  the  data  it  can  be  directly 
calculated  using  Equation  (22.8).  It  is  called  2SLS  as  it  can  instead  be  obtained  in 
two  stages  by  (1)  OLS  of  X,  on  Z,,  yielding  prediction  X,  ,  and  (2)  OLS  of  y,  on  X,  . 
An  estimate  of  the  variance  matrix  of  /32sls  that  is  both  panel  and  heteroskedasticity 
robust  is  that  given  in  (22.5)  with  Wjv  =  [Z'Z]-1. 


Two-Step  GMM 

The  most  efficient  GMM  estimator  based  on  the  unconditional  moment  condition 
(22.3)  uses  weighting  matrix  =  S  1 ,  where  S  is  consistent  for  S  defined  in  (22.6); 
see  Section  6.4.2  for  the  general  result.  Using  S  in  (22.7)  yields  the  two-step  GMM 
estimator 


Asgmm  =  [X'  ZS- 1 Z'  X]  - 1 X'  ZS- 1 Z'  y .  (22.9) 

Then  (22.5)  simplifies  and  V[32sgmm]  =  [X'ZlAS^Z'Xr1. 

This  is  called  two-step  GMM  since  a  first-step  consistent  estimator  of  (3  such  as 
02SLS  is  needed  to  form  the  residuals  u,  used  to  compute  S. 
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Efficiency  Gains 

In  this  chapter  the  focus  is  on  situations  where  Z  cannot  contain  all  of  X,  because 
of  endogeneity  of  some  components  of  X.  Then  panel  GMM  provides  consistent  esti¬ 
mates  when  OLS  does  not.  Two-step  GMM  provides  the  most  efficient  estimator  based 
on  the  moment  condition  E[Z'u,  ]  =  0. 

Even  if  regressors  are  strongly  exogenous,  two-step  GMM  has  the  attraction  of  be¬ 
ing  more  efficient  than  pooled  OLS.  To  see  this,  suppose  that  X  is  strongly  exogenous. 
Setting  Z  =  X,  the  two-step  GMM  estimator  simplifies  to  [X'X]  'X'v  and  there  is  no 
benefit  to  panel  GMM.  However,  if  instead  Z  equals  X  as  well  as  some  additional 
variables,  such  as  powers  of  the  regressors  or  regressor  values  in  periods  other  than 
the  current  period,  then  the  two-step  GMM  method  is  at  least  as  efficient  as  OLS,  with 
equality  applying  if  the  errors  u\t  are  iid. 

Even  more  efficient  estimators  than  /32sgmm  are  possible,  by  widening  the  definition 
of  Z by  using  the  optimal  moment  condition  based  on  E[u,  |Z,  ]  =  0,  which  need  not 
be  E[  Z'u,  |  =  0  (see  Section  22.4.3),  and  by  using  additional  moment  restrictions.  We 
shy  away  from  calling  two-step  GGM  the  optimum  GMM  estimator,  as  in  Section 
6.3,  because  it  is  only  optimal  given  (22.3). 


Tests  of  Overidentifying  Restrictions 


If  there  are  r  instruments  and  only  K  parameters  to  estimate,  then  panel  GMM  esti¬ 
mations  leaves  (r  —  K)  overidentifying  restrictions.  From  Section  6.3.8  this  permits  a 

test  of  overidentifying  restrictions 


OIR  = 


(Nsy1 


(22.10) 


where  u,  =  y,  —  Z'/32SGMM,  S  is  given  in  (22.7),  and  independence  over  i  is  assumed 
but  heteroskedasticity  and  correlation  over  t  for  given  i  is  permitted.  Note  that  /32SGMM 
must  be  used,  not  /32SLS. 

This  test  statistic  is  distributed  as  /2(r  —  K)  under  the  null  hypothesis  that  the 
overidentifying  restrictions  are  valid.  If  OIR  is  large  then  the  overidentifying  moment 
conditions  are  rejected  and  we  conclude  that  some  of  the  instruments  in  Z,  are  corre¬ 
lated  with  the  error  and  hence  are  endogenous. 


22.2.4.  Selection  of  Instruments 

The  discussion  so  far  has  assumed  the  existence  of  a  T  x  r  matrix  of  instruments  Z, 
that  satisfies  (22.3).  Now  we  provide  a  lengthy  discussion  of  how  to  obtain  instruments 
in  a  panel  setting. 

In  cross-section  models,  endogenous  variables  are  instrumented  by  variables  that 
do  not  appear  as  regressors  in  the  equation  of  interest.  Such  variables  can  also  be  used 
as  instruments  in  the  panel  case.  With  panel  models,  however,  the  additional  periods  of 
data  provide  additional  moment  conditions  and  additional  instruments  that  can  easily 
lead  to  identification  or  overidentification  of  (3. 
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The  number  of  moment  conditions  and  instruments  available  expands  as  pro¬ 
gressively  stronger  assumptions  are  made  about  the  correlation  between  uit  and  zis, 
s,t=  1 . T .  We  consider  the  effect  of  progressively  stronger  exogeneity  assump¬ 

tions,  see  Section  2.3,  following  M.-J.  Lee  (2002).  The  emphasis  is  on  using  exoge¬ 
nous  components  of  the  regressors  as  instruments  more  than  once,  but  the  technique 
also  applies  to  more  traditional  instruments  that  are  variables  excluded  from  the 
regression  (22.1). 


Summation  Assumption 

An  obvious  procedure  is  to  define  Z,  similarly  to  X, .  Then 


_z;r 

Mil 
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L  LiT  -1 
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where  z,r  is  r  x  1  and  E[ZJu,]  =  0  if  the  summation  assumption 
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Z it  Hit 


=  0 


(22.11) 


(22.12) 


is  satisfied. 

This  assumption  corresponds  to  that  used  in  pooled  OLS  regression  of  y,/  on  x,;, 
since  if  z,r  =  x,,  in  (22.12)  then  the  PGMM  estimator  defined  in  (22.4)  simplifies  to 
(EtZp^E/ZJy/. 

For  this  estimator  to  be  feasible  requires  at  least  that  the  order  condition  be  met,  so 
that  r  >  K .  Under  the  summation  assumption  it  is  just  as  difficult  to  find  instruments 
with  panel  data  as  it  is  with  cross-section  data. 


Contemporaneous  Exogeneity  Assumption 

A  stronger  and  more  natural  assumption  is  the  contemporaneous  exogeneity  assump¬ 
tion  that 


E  [znUu]  =  0,  t  =  1, . . .  ,T, 


(22.13) 


so  that  the  instruments  are  assumed  to  be  contemporaneously  uncorrelated  with  the 
error  term. 

This  presents  many  more  moment  conditions,  as  in  principle  there  as  many  as  Tr 
moment  conditions,  where  r  =  dim[z,,].  To  use  these  we  define 


"z/l 

0 

0  " 

M/1 

Z,  = 
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_m,t  _ 

(22.14) 
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where  Z,  is  now  Tr  x  T.  The  moment  condition  (22.3)  holds,  since  E[  Z'u,  ]  =  0  by 
(22.13),  but  now  (22.3)  defines  Tr  moment  conditions  that  can  be  used  to  estimate  the 
K  components  of  f3. 

This  remarkable  result  of  an  apparent  surfeit  of  moment  restrictions  comes  about 
because  of  the  implicit  assumption  that  (3  is  time-invariant,  so  that  each  additional  time 
period  offers  additional  moment  restrictions. 

The  number  of  additional  moment  restrictions  is  reduced  to  the  extent  that  (3  is  time 
varying.  In  particular,  the  intercept  is  often  permitted  to  vary  over  time  by  inclusion  in 
Xft  of  (T  —  1)  time  dummies  dsjt  =  1  if  t  =  s  and  0  otherwise,  for  s  =  2, . . . ,  T.  Then 
the  condition  E[dsjtUji\  =  0  cannot  be  used  as  it  duplicates  the  condition  E[1  x  //,,  |  = 
0  implied  by  inclusion  of  an  intercept  in  x,r.  In  the  preceding  example,  if  x\u  includes 
time  dummies  then  there  are  only  T  K  —  (T  —  1)  moment  conditions  available.  Any 
time-invariant  regressors  can  be  used  only  once  as  an  instrument. 

Weak  Exogeneity  Assumption 

Moment  condition  (22.13)  considers  only  contemporaneous  correlation  between  in¬ 
struments  and  regressors.  A  stronger  assumption  is  the  weak  exogeneity  assumption 
or  predetermined  instruments  assumption  that  additionally  lagged  values  of  the  in¬ 
struments  are  uncorrelated  with  the  current-period  error,  so  that 

E  [ZiSUit]  =  0,  s<t,  t  =  l,  ...  ,T.  (22.15) 

Condition  (22.15)  permits  z,  i ,  . . . ,  z,,  to  be  instruments  for  though  future  values 
of  z iS  cannot  be  so  used.  The  instrument  Z,  is  structured  similarly  to  (22.14),  except 
that  z!jt  is  replaced  by  the  expanded  instrument  vector  [z'n, . . . ,  z'jt]  that  increases  in 
size  as  t  increases. 

Conditions  of  this  sort  arise  in  rational  expectations  models  and  in  models  of  in¬ 
tertemporal  decision  making  under  uncertainty  that  lead  to  Euler  conditions  of  the 
form  E[n(,  |T(/  ]  =  0,  where  Xa  is  the  information  set  available  at  time  t  and  an  exam¬ 
ple  of  tin  is  given  in  Section  6.2.7.  If  the  information  set  includes  current  and  past 
values  of  zit  then E[«,-<|z,-J]  =  0,  s  <  t,  leading  to  (22.15). 

More  generally  these  conditions  become  relevant  in  dynamic  models  with  lagged 
dependent  variables  as  regressors  (see  Section  22.5).  In  some  instances  contempora¬ 
neous  correlation  is  not  ruled  out,  so  that  the  inequality  s  <  t  in  (22.15)  is  replaced  by 
s  <  t. 

Note  that  time-invariant  instruments  can  only  be  used  once.  Thus  if  zlt  =  [zi,  Z2u  ], 
then  Z|,  and  Z2/1, . . . ,  znt  are  available  as  instruments. 

Strong  Exogeneity  Assumption 

A  stronger  assumption  than  weak  exogeneity  is  the  strong  exogeneity  assumption 
that  future  values  of  instruments  are  also  uncorrelated  with  the  current  period  error,  so 
that 


E  [zISM,f]  =  0,  s,t=l,...,T.  (22.16) 

Then  current,  past,  and  future  values  of  z,.s  are  valid  instruments  for 
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This  assumption  was  maintained  for  the  regressors  xit  throughout  Chapter  21,  since 
E[ujt\Xji, . . . ,  X/7']  =  0  implies  E[n(,  |x,-s  ]  =  0,  1  <  s  <  T,  and  hence  E[xivm,,]  =  0.  It 
may  be  appropriate  for  static  models,  but  for  dynamic  models  at  most  weak  exogeneity 
of  instruments  can  be  assumed. 

Condition  (22.16)  permits  z,i, . . . ,  ziT  to  be  instruments  for  n„.  The  instrument  Z, 
is  structured  similarly  to  (22.14),  except  that  z'it  in  (22.14)  is  replaced  by  the  expanded 
instrument  vector  [z'n, . . . ,  z'jT], 

As  for  the  weak  exogeneity  case,  time-invariant  instruments  can  be  used  only  once. 
If  zit  =  [zn  Z2 it]  then  T (r-n  +  7Ytv)  moment  conditions  are  available,  where  r-n  and 
/'tv  denote  the  numbers  of  time-invariant  and  time- varying  instruments. 

The  extraordinary  number  of  moment  conditions,  as  many  as  rT2,  is  due  to  exclu¬ 
sion  restrictions  implicitly  made  in  the  panel  model  (22.1).  For  simplicity  suppose  all 
components  of  x(i  are  strongly  exogenous  and  we  wish  to  use  these  as  instruments 
whenever  possible.  In  general  yit  could  depend  on  the  regressors  in  all  time  periods, 
Xji, . . . ,  xiT.  In  contrast,  the  panel  model  yit  =  x'(/3  +  ult  with  E [xituit]  =  0  excludes 
all  but  x„  from  the  model  for  yit.  The  strong  exogeneity  assumption  that  E[x,vm„  ]  =  0 
then  permits  the  excluded  regressors  xis ,  s  ^  t,  to  be  used  as  instruments  in  addition 
to  xit. 


Redundant  Instruments 

If  z„  is  varying  over  both  i  and  t  then  its  lags  and  leads  can  also  be  used  as  an  in¬ 
strument,  depending  on  the  exogeneity  assumptions  made.  For  the  ifth  observation 
the  available  instruments  are  z„  under  contemporaneous  exogeneity,  zt\, ... ,  zit  under 
weak  exogeneity,  and  z;1, ...  ,ziT  under  strong  exogeneity.  This  makes  identification 
possible  using  only  exogenous  regressors  as  instruments.  Only  under  the  summation 
assumption  are  the  difficulties  of  finding  valid  instruments  comparable  to  those  in  the 
cross-section  case. 

In  practice,  however,  there  are  not  as  many  available  instruments  as  the  preced¬ 
ing  discussion  suggests.  Time-invariant  instruments  z,,  =  z,  can  be  used  only  once, 
since  then  z,,  =  zis  for  all  s  and  t.  For  example,  this  is  the  case  for  an  intercept  or 
for  a  race  or  gender  indicator.  If  the  instrument  is  a  regressor  and  lagged  values  of 
the  regressor  appear  in  the  model  then  the  number  of  available  instruments  is  reduced. 
Time-varying  instruments  that  vary  in  a  systematic  way  may  also  not  be  available  in  all 
periods.  Thus  instruments  that  are  the  product  of  time  dummies  and  a  time-invariant 
regressor  should  be  included  only  once  if  a  complete  set  of  time  dummies  is  used. 
Examples  include  time  dummies  and  time  dummies  interacted  with  race  or  gender  in¬ 
dicators.  Instruments  that  are  a  linear  function  of  time  should  be  used  only  once.  For 
example,  if  year  is  an  instrument  then  lagged  years  should  not  also  be  used.  This  com¬ 
ment  does  not  apply  to  age,  which  increases  linearly  for  each  individual  but  varies 
across  individuals. 

It  is  clearly  easy  to  inadvertently  use  redundant  instruments.  The  panel  GMM 
estimators  are  still  feasible  and  the  usual  results  are  valid  if  there  are  still  sufficient 
nonredundant  instruments.  For  example,  if  r  instruments  are  used  and  two  of  these 
are  redundant  the  model  is  still  estimable  provided  r  >  K  +  2  as  Z'X  is  still  of  full 
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rank  K.  Singularity  problems  in  GMM  estimation  may  arise  if  too  many  redundant 
instruments  are  used,  leading  to  an  underidentified  model.  Even  if  the  model  remains 
overidentified,  the  degrees  of  freedom  in  a  test  of  overidentifying  restrictions  will  be 
reduced  if  some  instruments  are  redundant. 


Weak  Instruments 

Weak  instruments,  not  to  be  confused  with  weak  exogeneity,  were  introduced  in  Sec¬ 
tion  4.9.  There  is  no  well-established  formal  test  of  weak  instruments.  Standard  R 2 
and  /-'-statistic  diagnostics  are  given  in  Section  4.9.  It  is  the  incremental  explanatory 
power  of  the  instruments  that  matters.  So  a  partial  R2  that  controls  for  exogenous  re¬ 
gressors  that  are  also  in  the  instrument  set  should  be  used.  Moreover,  whereas  the  en¬ 
dogenous  regressor  is  regressed  on  all  instruments,  the  /-’-statistic  should  be  one  of  the 
overall  significance  of  the  subset  of  the  instruments  that  are  not  exogenous  regressors. 

Since  the  errors  here  are  not  iid,  the  /-'-statistic  should  be  based  on  panel  robust  stan¬ 
dard  errors.  It  can  be  calculated  as  W/r*,  where  W  is  the  Wald  chi-square  test  statistic 
for  exclusion  restrictions  given  in  Section  7.2.7  and  r*  is  the  number  of  instruments 
that  are  not  regressors  in  the  original  model. 


22.2.5.  Computation  of  Panel  GMM  Estimators 

The  moment  conditions  discussed  in  the  preceding  section  provide  the  instrument  ma¬ 
trix  Z Then,  given  Z one  can  estimate  (3  by  /32SLS  defined  in  (22.8)  or  by  /32sgmm 
defined  in  (22.9). 

The  2SLS  estimator  is  easier  to  implement  than  the  two-step  GMM.  Consider  esti¬ 
mation  under  the  summation  assumption,  in  which  case  Z,  is  defined  in  (22.11).  Then 
/32sls  is  given  in  (22.8),  where  Z  X  =  JT  Z'X,  =  JT  z,,  x'(  and  similar  algebra  ap¬ 
plies  for  the  other  cross-products.  This  yields  the  standard  textbook  formula  for  2SLS, 
except  that  summation  is  over  both  i  and  t.  Thus  /32SLS  can  be  obtained  by  2SLS 
regression  of  yit  on  xit  using  a  cross-section  2SLS  package.  Panel-robust  standard 
errors  can  then  be  obtained  using  a  cluster-robust  option  that  permits  clustering  on  i, 
or  by  a  panel  bootstrap  that  resamples  over  i  rather  than  both  i  and  t.  The  approaches 
are  similar  to  those  for  pooled  LS  given  in  Section  21.2.3,  which  provides  additional 
detail. 

For  assumptions  other  than  the  summation  assumption  one  can  still  use  a  cross- 
section  2SLS  package  by  appropriately  defining  the  instrument  matrix  Z,  ,  which  then 
has  a  more  complicated  form.  For  the  contemporaneous  exogeneity  assumption,  Z,  is 
defined  in  (22.14).  This  is  in  the  same  form  as  (22.11)  if  the  fth  row  in  (22.11),  z'jt,  is 
replaced  by 

[Of]  •  •  •  0;.r  I  z'jt  0'f+i  •  •  •  0'r],  (22.17) 

where  rs  =  dim[z,  v]  and  0,  denotes  an  rs  x  1  vector  of  zeros.  Similarly,  for  the  weak 
exogeneity  assumption,  Z,  is  as  in  (22.1 1)  with  the  rth  row  in  (22.1 1),  z';,  replaced  by 

[°n  •  • '  °L  «)'  °r,+1  •  • '  <U>  (22-18) 
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where  (z'()'  =  [z'n. . .  z'.  ]  and  rs  =  dim[z-s],  and  for  the  strong  exogeneity  assumption, 
Z,  is  as  in  (22.11)  with  the  / 111  row  in  (22.1 1),  z';,  replaced  by 

(zjty  o;.+i  (22.19) 

where  (zlY)'  =  [z'n...z'jT]  and  rs  =  dim[z(Y].  A  practical  example  of  generating  the 
instruments  is  given  in  Section  22.3. 

In  practice  there  can  be  too  many  moment  conditions.  For  example,  with  10  pe¬ 
riods  of  data  and  5  time- varying  regressors  the  strong  exogeneity  assumption  yields 
as  many  as  5  x  102  =  500  moment  conditions  (and  the  preceding  row  vector  has  500 
entries)  with  only  5  parameters  to  estimate.  The  marginal  value  of  an  instrument  may 
be  very  slight,  because  of  increasing  multicollinearity  among  the  instruments,  leading 
to  a  situation  of  weak  instruments.  Good  practice  is  to  treat  time-varying  instruments 
that  vary  little  over  time  as  time-invariant.  For  example,  use  only  the  data  for  the  first 
period  as  an  instrument.  Even  instruments  that  vary  considerably  over  time  might  be 
used  for  only  a  few  periods  rather  than  in  all  possible  periods. 

Computation  of  the  more  efficient  /32sgmm  is  not  possible  using  only  a  2SLS  pack¬ 
age.  Instead,  either  more  specialized  software  is  needed  or  the  estimator  needs  to  be 
programmed  using  a  matrix  language  algorithm. 

Table  22.1  provides  a  summary  of  the  four  exogeneity  assumptions  and  the  resulting 
valid  instruments. 


22.2.6.  Variations  on  GMM  Estimation 

Although  02SGMM  is  more  efficient  than  02sls>  several  studies  find  it  to  have  greater 
finite-sample  bias  than  #2SLS;  especially  when  r  is  much  greater  than  K.  For  explana¬ 
tion  see  the  discussion  of  finite-sample  bias  of  optimal  GMM  in  Section  6.3.5. 

One  approach  is  to  be  judicious  in  the  use  of  instruments,  though  then  potential 
efficiency  gains  due  to  additional  instruments  are  lost. 

Several  authors  have  proposed  alternative  GMM  estimators  that  may  be  less  likely 
to  be  biased  in  finite  samples.  Many  of  these  are  presented  in  Section  6.4.4  and  are 
used  in  the  panel  study  by  Ziliak  (1997). 


Table  22.1.  Panel  Exogeneity  Assumptions  and  Resulting  Instruments 


Exogeneity  Assumption 

Moment  Condition 

Instrument  Vector" 

Summation 

E[X,  z,7«»]  =  o 

fcf] 

Contemporaneous 

E[z itun\  =  0,  all  t 

•  •  •  O'  z'  O'  ,  •  • 

rt- 1  It  rt+ 1 

•o;,.] 

Weak 

E  [z jSUit]  =  0,  s  <  t,  all  t 

[On 

•  •  •  o;  (z')'o;. 

rt~\  v  it'  rt+ 1 

•••o;Y] 

Strong 

E  [z iSUit]  —  0,  all  s  and  t 

[°n 

•  •  •  o:  (z f.y  o'. 

r t-\  v  it'  rt+ 1 

•••o;Y] 

a  The  instrument  vector  is  the  /  th  row  of  Z,  in  (22.1 1);  (z'()'  =  [zb. . . ztj,  (z  T)'  =  [zf. . .  z'jT\]  and  rs  =  dim[z'5] 
or  dim[z'5]  or  dim[zf  |. 
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22.2.7 .  Chamberlain’s  Optimal  Distance  Estimator 
Consider  estimation  of  the  individual-specific  effects  model 

yit=ai+x'itl3  +  Ui,,  (22.20) 

when  regressors  are  strongly  exogenous  as  in  Chapter  21.  In  Sections  21.2.3  and  21.6.1 
methods  to  obtain  panel-robust  standard  errors  for  the  within  estimator  were  presented. 

If  panel-robust  inference  is  warranted,  because  e„  are  not  iid,  then  the  estimators 
detailed  in  Chapter  21  are  actually  inefficient.  More  efficient  estimation  is  possible  us¬ 
ing  optimal  GMM  applied  to  an  overidentified  model.  Here  xis,  s  ^  t.  arc  available  as 
additional  instruments  and  GMM  can  be  applied  to  a  transformed  model  if  elimination 
of  o',  is  necessary  (see  Section  22.4.2).  The  efficiency  improvement  is  analogous  to 
that  for  cross-section  data  with  heteroskedasticity  (see  Section  6.3.5). 

Chamberlain  (1982,  1984)  proposed  the  following  more  efficient  estimator.  The 
model  (22.20)  can  be  stacked  to  yield 

y i  =  ea;  +  (Ir  0  /3')x,  +  u;  ,  (22.21) 

where  e  =  (1,  1, . . . ,  1)'  is  a  T  x  1  vector  of  ones,  x,  =  [x'  j . .  ,xjr]  is  a  TK  x  1  vec¬ 
tor,  and  y,  and  u,  are  T  x  1  vectors.  Equation  (22.21)  makes  clear  the  restrictions 
that  are  implicitly  made  in  static  models  that  specify  that  yit  depends  only  on  con¬ 
temporaneous  xit.  Chamberlain  used  linear  projection  arguments  that  rely  on  weaker 
assumptions  than  those  of  conditional  expectation.  Let 

E*[a,jx,]  =  M  +  X,  A'x'7  =  V  +  Ax< ’ 

where  E*  denotes  linear  projection.  Given  E[u,ja,  ,  x,]  =  0,  (22.21)  implies 

E* [y,jx,]  =  e/2+(Ir  0  f3'+  eA')x,-. 

This  imposes  restrictions  on  the  unrestricted  linear  projection  E*[y,jx,]  =  7To  +  7r'x,-, 
specifically  that  7r  —  It  0  (3'  +  eAr  =  0. 

Rather  than  use  GMM,  Chamberlain  proposed  the  following  two-step  procedure. 
First,  obtain  if  by  multivariate  OLS  regression  of  y,  on  intercepts  and  x,  .  Second, 
obtain  the  optimal  MD  estimator  (see  Section  6.7)  that  minimizes 

QN{p,  A)  =  (Vec[if— IT  0/3'-  eA'])' WA,(Vec[if-IT  0  0  -  eA']), 

where  the  optimal  weighting  matrix  =  ( V [ Vcc [ 7T ] ) )  This  yields  estimator  /3 

that  is  more  efficient  than  OLS  estimation  of  (22.20)  if  uit  is  heteroskedastic. 

Minimun  distance  estimation  has  been  supplanted  by  GMM;  see  Arellano  (2003, 
pp.  22-23)  and  Crepon  and  Mairesse  (1995)  for  comparison  of  Chamberlain’s  MD 
estimator  with  GMM.  However,  Chamberlain’s  approach  of  obtaining  moment  restric¬ 
tions  via  exogeneity  assumptions  and  assumptions  on  the  individual  effects  has  had 
a  big  impact  on  the  panel  literature.  His  MD  estimator  is  also  used  for  estimation  of 
covariance  structures  (see  Section  22.5.4). 
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22.3.  Panel  GMM  Example:  Hours  and  Wages 

We  return  to  the  hours-wages  example  of  Section  21.3.  Unlike  as  in  Chapter  21  regres¬ 
sors  are  now  permitted  to  be  endogenous,  and  unlike  as  in  Section  22.2  an  individual- 
specific  fixed  effect  is  included.  Estimation  is  by  the  IU  methods  of  Section  22.2,  after 
first-differencing  to  eliminate  the  fixed  effects. 

The  regression  model  is 


Inhrs,-,  =  cij  +  /Sjlnwg;,  +  /Sokids,-,  +  /33age,,  +  /34agesq„  +  disab,-,  +  uit, 

where  interest  lies  in  the  intertemporal  substitution  wage  elasticity  of  labor  supply,  f , 
the  coefficient  of  lnwg,  and  the  additional  regressors  are  number  of  children,  age,  age 
squared,  and  an  indicator  for  disability. 

MaCurdy  (1981)  derived  this  relationship  using  a  life-cycle  labor  supply  model  un¬ 
der  uncertainty.  The  model  is  then  a  “A-constant”  model  where  a,  here  equals  A a 
multiple  of  the  marginal  utility  of  initial  wealth  that  is  time-invariant  but  will  differ 
across  individuals.  Since  A ,■  depends  on  variables  and  constraints  it  needs  to  be  treated 
as  a  fixed  rather  than  random  effect.  The  labor  supply  literature  presents  several  meth¬ 
ods  for  controlling  for  this  fixed  effect. 

One  method,  discussed  further  in  Section  22.4.2,  is  to  first  difference  the  regression 
model,  yielding 


Alnhrs/,  =  Alnwg,-,  +  ,82Akids,-,  +  yS,Aag  e„  +  y84Aagesq,-,  +  yS5Adisab,7  +  A  uit. 

(22.22) 

Estimation  by  OLS  is  then  consistent  for  (3  if  all  regressors  are  exogenous.  Note  that 
this  differencing  induces  serial  correlation  in  the  error  even  if  uit  are  iid,  so  panel- 
robust  standard  errors  should  be  used. 

Ziliak  (1997)  instead  permitted  lnwg,,  to  be  contemporaneously  correlated  with 
Uit,  because  of  measurement  error  in  wage  or  because  of  kink  points  in  the  budget 
constraint.  Then  the  OLS  estimator  of  (22.22)  is  inconsistent. 

Ziliak  proposed  IV  estimation  using  suitably  lagged  regressors  as  instruments.  As¬ 
sume  that  past  wages  are  uncorrelated  with  the  error,  so  that  lnwg  is  weakly  exogenous 
aside  from  being  contemporaneously  correlated  with  the  error.  Then  E[lnwg,.sn,,]  =  0 
for  s  <  t  —  1  implies  that  for  the  differenced  model  error  E[lnwg,s  A«,,]  =  0  for 
s  <  t  —  2,  so  lnwg  lagged  two  or  more  periods  may  be  used  as  an  instrument  in  the 
first-differences  model.  Note  that  this  means  that  at  least  three  periods  of  the  original 
data  are  needed  to  identify  (3. 

Ziliak’s  study  focused  on  the  properties  of  panel  GMM  estimators  with  endogenous 
regressors,  so  he  treated  all  the  regressors  in  (22.22)  as  endogenous  and  used  as  in¬ 
struments  lags  of  one  or  more  periods  in  the  levels  of  the  other  four  regressors.  For 
simplicity  an  intercept  and  time  dummies,  individual-invariant  instruments  that  can  be 
only  used  once,  were  not  included.  Results  here  change  little  with  inclusion  of  an  in¬ 
tercept  as  the  dependent  variable  is  in  differenced  form.  Since  lnwg,,  is  always  used 
as  an  instrument  the  first  two  years  are  dropped  and  only  the  eight  years  1981-1988 
are  used  to  estimate  (22.22). 
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Table  22.2.  Hours  and  Wages:  GMM-IV  Linear  Panel  Model  Estimators a 


OLS 

Base  Case 

Stacked 

2SLS 

2SGMM 

2SLS 

2SGMM 

01 

0.112 

0.209 

0.547 

0.543 

0.330 

Panel  se 

(.096) 

(-374) 

( .327  ) 

(.209) 

(.110) 

Het  se 

[.079] 

[.423] 

[-] 

[.226] 

[-] 

Default  se 

{.023} 

{.389} 

{-} 

{.169} 

{-} 

RMSE 

.283 

.296 

.307 

.307 

.298 

Instruments 

5 

9 

9 

72 

72 

OIR  Test 

- 

- 

5.45 

- 

69.51 

dof 

- 

- 

4 

- 

67 

-value 

- 

- 

.244 

- 

.393 

N 

4256 

4256 

4256 

4256 

4256 

a  Differenced  regression  uses  annual  data  from  1981-1988  for  532  men.  Reported  are  /Jj,  the  coefficient  of  A 
lnwg,  and  three  estimated  standard  errors:  panel  robust  in  parentheses,  heteroskedastic  robust  in  square  brackets, 
and  usual  default  estimates  that  assume  iid  errors  in  curly  braces.  All  regressions  additionally  include  Akids, 
Aage,  Aagesq,  and  Adisab  as  regressors  but  their  coefficient  estimates  are  not  reported.  The  instruments  are 
lnwg  lagged  twice  and  kids,  age,  agesq,  and  disab  lagged  both  once  and  twice.  For  the  base  case  there  are  9 
instruments  and  for  stacked  instruments  there  are  8  x  9  =  72  instruments.  RMSE  is  the  root  mean  square  error 
of  the  residual.  OIR  is  the  over  identifying  restictions  test  statistic,  dof  is  the  degrees  of  freedom,  and  /7-value 
is  the  /7-value  for  this  test. 


Table  22.2  presents  a  small  subset  of  the  many  results  given  in  tables  1  and  2  of 
Ziliak  (1997).  For  completeness  various  standard  error  estimates  are  given  but  the 
panel-robust  standard  errors  should  be  used. 

OLS:  The  column  OLS  reports  OLS  estimation  of  (22.22).  The  labor  supply  elasticity 
of  0.1 12  differs  a  little  from  the  estimate  of  0.109  in  the  First-Diff  column  of  Table 
21.2  as  here  the  four  demographic  variables  are  also  included  as  regressors  and  an 
additional  year  of  data  has  been  dropped.  Because  first  differences  are  modeled  the 
model  fit  is  poor,  and  the  R2  with  additional  inclusion  of  an  intercept  is  0.006. 

2SLS  with  Base-Case  Instruments:  The  base-case  instruments  use  Z,  defined 
in  (22.11),  where  zit  has  nine  entries:  lnwg,  f_2,  kids,  age, agesq, _,_i, 
disab,  ,_i,  kidsi  ,_2,  ageIi?_2,  agesq,  ,_2,  and  disab,  ,_2.  The  model  is  then  overi¬ 
dentified  with  nine  instruments  and  five  parameters  to  estimate.  The  2SLS  estimate 
of  ft  |  is  much  less  precise  than  the  OLS  estimate,  with  standard  error  increasing 
fourfold  from  0.096  to  0.374.  For  the  other  regressors,  not  reported,  the  efficiency 
loss  is  much  less. 

2SLS  with  Stacked  Instruments:  The  base  case  is  GMM  based  on  the  nine  moment 
conditions  E[J}'£3  z, ■,«,■,]  =  0.  The  stacked  instruments  instead  use  72  (=  8x  9) 
moment  conditions  E[z,,n,r]  =  0,  t  =  3, . . . ,  10,  where  z,,  is  as  in  the  base  case. 
Then  use  Z,  defined  in  (22.14),  where  here  Z,  is  8  years  by  72  instruments.  The 
fth  row  of  Z,  is  given  in  (22.17),  where  z,r  here  is  the  9  x  1  column  vector  of  in¬ 
struments  for  the  base  case.  To  construct  the  instruments  first  generate  72  variables 
Ztj  equal  to  zero  for  all  i  and  t,  where  t  denotes  the  year  and  j  denotes  the  /  th 
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instrument.  Then  replace  zsj,,  by  zuj  if  t  =  s  but  leave  zsj„  =  0  if  r  /  s.  For  ex¬ 
ample,  if  t  =  3  (the  third  year)  set  z.35  equal  to  disab,;^  if  the  fifth  instrument  is 
disab,  ,_  |  and  keep  z,t5  equal  to  zero  for  t  7^  3.  The  2SLS  estimates  can  then  be 
obtained  by  standard  2SLS  regression  of  Alnhrs,,  on  the  five  regressors  in  (22.22) 
with  these  72  constructed  variables  as  instruments.  Using  the  expanded  instruments 
we  have  that  the  standard  error  of  the  2SLS  estimate  falls  from  0.374  to  0.209  and 
is  only  twice  that  of  the  original  OLS  estimate. 

Two-step  GMM:  The  two-step  GMM  estimates  in  Table  22.2  differ  from  those  in 
table  1  of  Ziliak  (1997)  as  a  panel-robust  estimate  of  S  defined  in  (22.7)  is  used  here 
to  form  the  weighting  matrix,  whereas  Ziliak  used  the  heteroskedastic-robust  S  = 
N  1  Y.i^itAv  exPected,  the  two-step  GMM  estimator  is  more  efficient  than 
2SLS,  with  standard  error  falling  from  0.374  to  0.327  with  base-case  instruments 
and  from  0.209  to  0.1 10  with  stacked  instruments.  This  last  standard  error  is  not 
much  larger  than  that  for  OLS. 

Test  of  Overidentifying  Restrictions:  The  test  statistic  for  overidentifying  restrictions 
is  given  in  (22.10).  From  Table  22.2  for  both  base  case  and  stacked  instruments  the 
test  statistic  has  p-value  much  higher  than  0.05,  so  the  restrictions  are  not  rejected 
and  we  conclude  that  the  overidentifying  instruments  are  valid  instruments. 

Test  of  Weak  Instruments:  Diagnostics  for  weak  instruments  were  presented  in  Sec¬ 
tion  22.2.4  and  Section  5.9.  Since  none  of  the  regressors  appear  in  the  instrument 
set  the  overall  F-statistic  from  the  first-stage  regression  is  used  rather  than  a  sub¬ 
set  of  regressors  F-statistic.  For  the  base-case  instruments,  regression  of  Alnwg  on 
the  nine  instruments  and  a  constant  term  yields  panel -robust  F  =  2.80,  and  similar 
regression  for  the  72  stacked  instruments  yields  F  =  1.90,  indicating  finite-sample 
bias  is  very  likely.  Similar  regressions  for  Akids,  Aage,  Aagesq,  and  Adisab,  re¬ 
gressors  in  (22.22)  that  are  also  being  treated  here  as  endogenous,  yield  F  >  8.5 
in  all  cases.  Shea’s  partial  R 2  (see  Section  4.9.1)  is  0.0036  for  Alnwg  and  exceeds 
0.075  for  the  other  four  endogenous  regressors.  The  weak  instruments  problem  is 
therefore  due  to  the  problems  of  finding  a  good  instrument  for  Alnwg. 

Efficiency  Gains:  In  this  example  panel  GMM  estimators  were  used  to  control  for 
endogeneity.  However,  even  if  all  the  regressors  are  assumed  to  be  strongly  ex¬ 
ogenous,  panel  GMM  is  still  attractive  as  it  is  more  efficient  than  OLS  unless  the 
errors  uit  are  iid;  see  the  discussion  after  (22.20).  As  an  example,  the  panel  two-step 
GMM  estimator  with  instrument  set  the  base-case  instruments  plus  the  five  original 
regressors  in  (22.22)  yields  /; ,  =  0.016  with  a  standard  error  of  0.076,  lower  than 
the  OLS  standard  error  of  0.096. 


22.4.  Random  and  Fixed  Effects  Panel  GMM 

We  now  augment  the  panel  data  model  (22.1)  by  including  a  time-invariant  additive 

individual-specific  effect  a,  ,  so 


yit  —  O';  +  xj,/ 3  +  £;/. 


(22.23) 
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Then  the  error  term  in  (22.1)  is  now  modeled  as  uit  =  a,  +  sit.  For  simplicity  the  same 
notation  is  used  for  both  fixed  and  random  effects  models,  so  in  the  case  of  random 
effects  model  the  common  intercept  fi  in  Section  21.7  is  subsumed  into  x-;/3. 

Some  components  of  the  regressors  xit  are  assumed  to  be  endogenous,  with 
E[x,r(o'(  +  su)  \  7^  0,  so  that  the  OLS  estimator  of  (3  is  inconsistent.  In  this  section 
we  propose  IV  estimators  that  yield  consistent  estimates  of  (3  in  a  variety  of  settings, 
including  fixed  effects,  random  effects,  a  hybrid  of  the  two,  and  systems  of  equations. 


22.4.1.  Random  Effects  or  Fixed  Effects? 

Recall  from  Chapter  21  that  the  individual-specific  effect  a,  can  be  viewed  as  random 
in  both  the  FE  and  RE  models.  This  random  variable  a,  was  independent  of  x,,  in  the 
RE  model  but  correlated  with  x,f  in  the  FE  model.  For  the  RE  model  all  coefficients 
are  estimable,  whereas  in  the  FE  model  coefficients  of  time-invariant  regressors  are 
not  estimable  as  consistent  estimation  requires  elimination  of  a,  and  the  time-invariant 
regressors  by  differencing. 

In  this  chapter  with  endogenous  regressors  we  view  a  model  to  be  a  random  effects 
model  if  instruments  Z,  exist  that  satisfy  E[  Z-icy,  +  su )  ]  =  0.  Then  the  methods  of 
Section  22.2  will  permit  consistent  estimation  of  all  regression  parameters.  If  instead 
it  is  possible  only  to  find  instruments  such  that  E[ZJe,,]  =  0,  but  ElZ^a,]  /  0,  we  view 
the  model  to  be  a  fixed  effects  model.  Then  a,-  must  be  eliminated  by  differencing,  in 
which  case  only  the  coefficients  of  time- varying  regressors  will  be  identified. 


22.4.2.  IV  for  Fixed  Effects  Models 

The  various  differencing  operations  given  in  Section  21.2  applied  to  (22.23)  lead  to  a 

transformed  model  of  the  form 

yu  =  %,(3  +  sit, 

where  the  tilda  denotes  a  differencing  transformation  that  eliminates  otj,  and  leading 
examples  are  given  in  the  following.  Upon  stacking  we  get 

Ji=%P  +  ei.  (22.24) 

If  E[x, ■,£,,]  V  0  then  E[x ]  /  0  and  LS  estimation  of  (22.24)  leads  to  inconsistent 
estimates. 

We  now  consider  IV  estimation,  assuming  existence  of  instruments  Z,  that  satisfy 
E[Z-e,-]  =  0.  Then  panel  GMM  estimation  (IV,  2SLS,  or  2SGMM)  of  (22.24)  with  in¬ 
struments  Z,  yields  consistent  estimates  of  the  coefficients  of  time- varying  regressors. 
Panel-robust  standard  errors  can  be  computed  as  discussed  in  Section  22.2.2. 

One  way  that  instruments  may  be  obtained  is  through  logic  similar  to  that  in  the 
cross-section  case.  A  valid  instrument  is  a  variable  correlated  with  the  regressor  but 
not  the  error,  yet  is  also  one  that  can  be  excluded  from  the  right-hand  side  of  (22.23). 
Another  way  to  obtain  instruments,  emphasized  here,  is  through  use  of  exogenous 
regressors  in  periods  other  than  the  current  period,  using  the  exogeneity  assumptions 
detailed  in  Section  22.2.4. 
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The  primitive  assumptions  for  instrument  availability  are  those  on  correlation  be¬ 
tween  z,v  and  Sit.  However,  here  it  is  correlation  between  zis  and  the  differenced  er¬ 
ror  Su  that  matters.  In  general,  differencing,  necessary  to  eliminate  the  fixed  effect, 
reduces  the  number  of  available  instruments.  Some  differencing  operations  lead  to 
greater  loss  than  others  and  can  even  lead  to  inconsistent  IV  estimation.  We  consider 
three  differencing  operations  with  focus  on  weakly  exogenous  instruments.  This 
can  be  a  more  realistic  assumption  in  practice,  especially  for  application  to  dynamic 
models. 


IV  for  the  First-Differences  Model 

The  first-differences  IV  estimator  is  the  IV  or  2SLS  or  panel  GMM  estimator  of  the 
first-differences  model 

yit  -  yi,t- 1  =  (xif  -  X, 73  +  (e,f  -  Sij- 1),  t  =  2,...,T.  (22.25) 

The  weak  exogeneity  assumption  that  E[z(i£,,]  =  0  for  ,v  <  t  implies  E[z,v(£,/  — 
Gv— 1)]  =  0  for  5  <1—1.  First  differencing  therefore  shortens  the  time  series  on  the 
available  instrument  set  by  one  period,  so  that  only  z,j(_  ] ,  zi  (_2, ...  are  available  as 
instruments.  Assuming  weak  exogeneity,  these  yield  a  consistent  IV  estimator  of  (3. 

The  use  of  lagged  regressors  as  instruments  was  first  proposed  by  Anderson  and 
Hsiao  ( 1 98 1 )  in  the  context  of  dynamic  panel  models  and  was  expanded  upon  by  Holtz- 
Eakin,  Newey,  and  Rosen  (1988)  and  Arellano  and  Bond  (1991)  (see  Section  22.5.3). 
Section  22.3  provided  a  detailed  empirical  example  of  this  approach. 

Note  that  one  can  instead  use  transformed  instruments  z,.s  =  Az,<  =  z,.s  —  z,„s_i, 
s  <  t  —  1.  However,  there  is  no  gain,  since  using  Az,  ,_  1 , . . . ,  Az,2,  z,i  is  equivalent 

to  using  z/j-i, _ z,2,  z,  1  as  instruments,  and  only  z,  i  and  not  Az,  i  can  be  computed 

if  data  begin  in  period  1 . 


IV  for  the  Within  or  Mean-Differenced  Model 

The  within  IV  estimator  is  the  IV  or  2SLS  or  panel  GMM  estimator  of  the  within 
model  or  mean-differenced  model 

yu  ~  yi  =  (x/,  -  x,- )'  (3  +  (eif  -  Si).  (22.26) 

Then  E[z,.v£,,]  =  0  for  s  <  t  no  longer  implies  E[z,- ,(£,-,  —  £,)]  =  0  even  for  .v  much 
less  than  t.  To  see  this  suppose  that  E[z,s£,f]  7^  0  for  s  >  t.  Then  E[z,s£,]  /  0  for  all  ,v 
since  £,  =  T  1  X]  e, ,  includes  past  £,,,  which  are  correlated  with  z,  v . 

Thus  IV  estimation  of  the  within  model  leads  to  inconsistent  estimation  of  (3  if  the 
instruments  are  weakly  exogenous  or  if  they  satisfy  the  even  weaker  assumptions  of 
contemporaneous  exogeneity  or  the  summation  condition.  The  within  transformation 
can  only  be  used  if  the  instruments  are  actually  strongly  exogenous. 
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IV  for  the  Forward  Orthogonal  Deviations  Model 

An  alternative  method  to  first  differences,  one  that  also  requires  that  instruments  be 
only  weakly  exogenous  rather  than  strongly  exogenous,  was  proposed  by  Arellano 
and  Bover  (1995).  We  also  present  this  method,  even  though  first  differences  are  used 
much  more. 

For  the  stacked  model  (22.2)  for  the  ith  observation,  the  first-difference  transfor¬ 
mation  yields  model  Dy,  =  DX,/3  +  De(,  where  D  is  a  (T  —  l)x  T  matrix  with  en¬ 
try  Drv,  t  =  1 , ,T  —  1,  s  =  1 , ,T,  equal  to  minus  one  if  s  =  t,  equal  to  one  if 
s  =  t  +  1,  and  equal  to  zero  otherwise.  If  are  iid  the  transformed  error  is  MA(1) 
and  V[Du,]  =  ct2DD'.  The  GLS  estimator  then  premultiplies  De,  by  (DD,)_1E  or 
premultiplies  e,  by  (DD)_1/2D.  This  yields  a  transformed  model  of  the  form  (22.24) 
where  the  tilda  denotes  premultiplication  by  (DD  j  12 1). 

If  the  upper  triangular  Cholesky  factorization  is  used  to  obtain  (DD')-^2,  then  this 
yields  the  forward  orthogonal  deviation  model 

ct(ju  -  yft)  =  ct(x, it  -  x[t)' /3+ct(Sjt  -  sft)  (22.27) 

(see  Arellano,  2003,  p.  17),  where  c2  =  (T  —  t)/(T  —  t  +  1)  and  the  superscript  F 
denotes  that  only  future  values  are  used  to  form  the  average.  For  example,  >4'  =  (T  — 

t)~l  EL+i  yts- 

The  transformation  is  called  orthogonal  deviations  because  the  transformed  errors 
ct(Sit  —  s'  )  have  unit  variance  and  are  uncorrelated.  The  adjective  forward  is  added 
as  the  transformed  error  depends  only  on  current  and  future  values  of  the  original 
error.  An  OLS  estimation  of  (22.27)  yields  the  within  estimator  of  Chapter  21,  so  the 
orthogonal  deviations  transformation  is  optimal  if  indeed  Sj,  are  iid. 

The  forward  orthogonal  deviations  IV  estimator  is  the  IV  or  2SLS  or  panel 
GMM  estimator  of  the  model  (22.27).  For  weakly  exogenous  instruments,  E[z;i£,f]  = 
0  for  s  <  t  implies  E[z,s(e,f  —  s'  )\  =  0  for  .v  <  t.  Forward  orthogonal  deviations 
therefore  lead  to  no  loss  in  the  number  of  available  instruments.  The  transformation  is 
usually  not  applied  to  the  instruments  as  (zit  —  vS  )  involves  future  values  of  z,r  that  in 
many  applications  are  correlated  with  eit . 


22.4.3.  IV  for  Random  Effects  Models 
The  model  stacked  for  the  i  th  observation  is 

y  i  =  X,-/3  +  eo/j+ei, 

where  e  is  a  T  x  1  vector  of  ones.  Consistent  but  inefficient  estimates  can  be  obtained 
by  directly  applying  the  panel  GMM  estimators  of  Section  22.2  given  instruments  Z,  , 
obtained  through  exclusion  restrictions  or  through  appropriate  exogeneity  restrictions, 
such  that  E[Zj(eo',  +  £,  )]  =  0.  Here  we  go  further  and  consider  more  efficient  esti¬ 
mation  that,  as  in  Chapter  21,  controls  for  error  correlation  over  time  given  the  error 
components  model  uu  =  otj  + 
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IV  Estimation  of  Transformed  Model 

Assume  that  the  instruments  Z,  satisfy  E[u,  |Z,]  =  0  and  V[u,  |Z,]  =  12, ,  where  12, 
has  the  same  form  as  the  standard  RE  model  with  diagonal  entries  a2  +  a 2  and  off- 
diagonal  entries  a2.  Note  that  this  is  a  stronger  assumption  than  E[ Z'u,  |  =  0  and  will 
therefore  place  restrictions  on  available  instruments. 

Given  the  conditional  moment  condition  E[u,  |Z,  ]  =  0,  from  Section  6.3.7  the  opti¬ 
mal  unconditional  moment  condition  is 

EtZ^r'u,]  =  E[(12r1/2Z,)'(f271/2u,)]  =  0. 

This  leads  to  GMM  estimation  in  the  transformed  system  y*  =  X*/3  +  u*  with  trans¬ 
formed  instruments  Z*,  where  the  asterisk  denotes  premultiplication  by  the  T  x  T 

matrix  12,  1  /_  or  a  consistent  estimate  12,  ~. 

'  - 1 /2 

From  Section  21.7.1  premultiplication  by  12,  leads  to  the  model 

-  Ay,  =  (x,-,  -  Xxj)'(3  +  {(1  -  A)a,  +  (e,-,  -  Ae,)},  (22.28) 

where  A.  is  a  consistent  estimate  of  A  =  1  —  ae/y/a2  +  T a2.  The  random  effects  IV 
estimator  is  the  IV  or  2SLS  estimator  of  this  model  with  transformed  instruments 
zit  =  (z it  —  Az ,  ),  or  equivalently  with  instruments  zit  —  z,  and  z, . 

This  method  requires  a  consistent  estimate  A  of  A.  For  cr2  we  use  o2  = 
JV  'sj,/N(T  —  1),  where  eif  is  the  residual  from  within  IV  regression  of  yit  —  y,  on 
(x,-r  —  x, )  with  instruments  (z,-r  —  z ,■)  (see  (22.26)).  Also,  (cr2  +  T a2)  can  be  estimated 
by  JV  u2/N,  where  u,  is  the  residual  from  the  between  IV  regression  of  y,  on  x,  with 
instruments  z, .  The  resulting  IV  estimator  of  / 3  is  called  the  error  components  2SLS 
(EC2SLS)  estimator  by  Baltagi  (1981). 

These  results  are  dependent  on  specificaton  of  a  particular  functional  form  for  12,  . 
The  results  in  Section  22.2.2  permit  inference  that  is  robust  to  misspecification  of 
12,  ,  using  (22.5)  where  y,  X,  Z,  and  W n  =  [Z'Z]  1  are  replaced  by  the  transformed 
variables  in  (22.28). 

A  more  important  restriction  is  that  this  method  can  only  be  used  if  the  original 
instruments  are  strongly  exogenous.  Here  consistency  requires  that  E[Z.  12,~1u,]  = 
0,  a  much  stronger  assumption  than  E[Z'u,  j  =  0,  which  essentially  requires  that 
E[u,  |Z,]  =  0.  For  example,  suppose  E\z,,a,  ]  =  0  for  all  t  whereas  E[z,se,f]  =  0  for 
s  <  t  but  E[z,re,f]  7^  0  for  s  >  t  .  Then  E[z,re,]  ^  0,  leading  to  correlation  of  instru¬ 
ments  with  the  error  term  in  (22.28). 


22.4.4.  IV  for  the  Hausman-Taylor  Hybrid  Model 

A  leading  example  of  endogeneity  involves  regressors  correlated  with  the  individual- 
specific  effect  a,  .  This  leads  to  inconsistency  of  the  RE  estimator  of  Chapter  21.  An 
obvious  solution  is  to  instead  use  the  within  (or  fixed  effects)  estimator,  which  is  con¬ 
sistent.  However,  then  the  coefficients  of  time-invariant  individual  regressors  cannot  be 
identified.  This  defeats  the  purpose  of  many  panel  studies  -  estimation  of  the  effect  of 
time-invariant  regressors,  such  as  the  effect  of  the  level  of  schooling  in  a  postschooling 
earnings  regression. 
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Hausman  and  Taylor  (1981)  considered  the  following  variant  of  (22.23): 

y<t  =  Ai,P\  +  x2  it@2  +  wi;7i  +  w'2/72  +  ®>  +  (22.29) 

where  some  regressors  are  assumed  to  be  correlated  with  a,  whereas  others  are  not, 
and  w  is  introduced  to  denote  time-invariant  regressors.  Specifically,  xi,r  and  wi,  are 
uncorrelated  with  a,  but  x2i-,  and  w2(-  are  correlated  with  a,-.  All  regressors  are  assumed 
to  be  uncorrelated  with  sl  t.  In  this  model  the  a,  can  be  viewed  as  a  hybrid  of  random 
and  fixed  effects. 

Hausman  and  Taylor  (1981)  proposed  making  use  of  the  time-varying  exogenous 
regressor  xi,-f  in  two  ways:  to  estimate  f3l  and  as  an  instrument  for  w2,,  permit¬ 
ting  estimation  of  7.  Then  7  is  identified  if  the  number  of  time-varying  exogenous 
regressors  equals  or  exceeds  the  number  of  time-invariant  endogenous  regressors. 
Amemiya  and  MaCurdy  (1986)  proposed  a  more  efficient  estimator  that  uses  x\it  in 
( T  +  1)  ways:  to  estimate  /31  and  as  T  instruments  for  w2l,  permitting  identification 
if  dim[w2;]  >  rdim[xi,,].  This  approach  to  obtaining  instruments  from  exogenous  re¬ 
gressors  in  periods  other  than  the  current  period  has  already  been  discussed  in  detail 
in  Section  22.2.4. 

Various  projections,  some  equivalent,  can  be  used  to  generate  suitable  instruments. 
Breusch,  Mizon,  and  Schmidt  (1989)  provided  a  simpler  presentation  and  projection 
that  permits  estimation  using  a  2SLS  package. 

First  consider  consistent  but  inefficient  estimation  that  ignores  the  panel  correlation 
structure  of  (a,  +  £,-,).  The  within  transformation  eliminates  correlation  with  « , ,  so 
x2i'r  =  x2,'(  —  x2,  can  be  used  as  instrument  for  endogenous  x2ir.  The  instrument  for  X],r 
is  similarly  xi,-f,  rather  than  the  more  obvious  x  1 .  Then  xi,  is  used  as  an  instrument 
for  endogenous  w2r-,  whereas  the  exogenous  wi,  is  used  as  an  instrument  for  itself. 

Now  consider  efficient  estimation  under  the  random  effects  assumption  that  the 
components  o',  and  sjt  are  homoskedastic.  Then  from  (22.27)  the  random  effects 
differencing  transformation  (see  22.28)  leads  to 

%  =  xi/f/3i  +  x2/f/32  +  w'u7i  +  w2(-72  +  vit,  (22.30) 

where,  for  example,  xm  =  X|,(  —  /,X|M  where  an  estimator  for  the  scalar  X  has  been 
presented  at  the  end  of  the  preceding  section.  The  Hausman-Taylor  estimator  is  equiv¬ 
alent  to  IV  estimation  of  (22.30)  using  as  instruments  xm,  x2if,  wu,  and  xi,-.  The  ex¬ 
ogenous  time-varying  regressors  xlit  =  xU(  +  xi(  are  used  as  instrument  twice,  with 
the  within  difference  x1(f  used  as  an  instrument  for  x\it  and  the  time  average  Xi,  used 
as  an  instrument  for  w2, .  The  estimator  of  Amemiya  and  MaCurdy  (1986)  instead  uses 
as  instruments  Xi„,  x2!f,  Wj,  and  X\it , . . . ,  xliT,  so  that  the  entire  history  of  Xi,  rather 
than  just  the  time  average  is  used  as  an  instrument.  This  requires  that  E[xi,;a,  ]  =  0  for 
t  =  l, ...  ,T,  a  stronger  assumption  than  E[xi,a,]  =  0  (see  Section  22.2.4).  Breusch 
et  al.  (1989)  proposed  an  even  more  efficient  estimator  using  x2(s ,  s  ^  t.  as  additional 
instruments. 

The  major  limitation  of  this  approach  is  that  it  requires  specification  of  which  re¬ 
gressors  are  either  correlated  or  not  correlated  with  a,  .  In  a  post  schooling  log-wage 
regression,  Hausman  and  Taylor  begin  by  assuming  that  all  three  time-varying  re¬ 
gressors  (experience,  bad  health,  and  unemployment  last  year)  are  exogenous,  two 
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time-invariant  regressors  (race  and  union  status)  are  exogenous,  and  the  time-invariant 
regressor  of  interest  (schooling)  is  endogenous.  In  this  specification  there  are  two 
overidentifying  restrictions.  A  model  specification  test  is  possible  using  a  Hausman 
test  based  on  the  difference  between  /3HT  and  /3W,  since  the  within  estimator  for  (3 
is  consistent  regardless  of  which  components  of  x„  and  w,  are  correlated  with  a,  . 
Cornwall  and  Rupert  (1988)  provide  an  empirical  study  that  contrasts  the  various 
estimators. 


22.4.5.  SUR  and  Simultaneous  Equations  Estimation 

The  preceding  panel  data  analysis  has  focused  exclusively  on  estimation  of  a  single 
equation  in  isolation.  In  some  cases  it  may  be  desired  to  estimate  a  system  of  equations, 
such  as  a  system  of  demand  equations,  where  dependent  variables  and  regressors  are 
observed  for  many  individuals  at  several  points  in  time.  If  there  are  no  cross-equation 
restrictions  on  the  parameters  then  single-equation  estimation  can  yield  consistent  es¬ 
timates,  but  more  efficient  estimation  is  possible  using  joint  equation  estimation  that 
exploits  error  correlation  across  equations. 

In  the  Chapter  21  framework  of  strongly  exogenous  regressors,  the  more  efficient 
estimator  is  an  extension  of  seemingly  unrelated  regressions  from  cross-section  to 
panel  data.  The  error  components  SUR  model  specifies  the  gth  of  G  equations  to 
be  given  by 

ygit  =  x'gitP  +  oigi  +sgit,  g  =  1, . . . ,  G,  (22.31) 

where,  as  in  the  cross-section  case,  asi  is  independent  over  i,  egit  is  independent  over 
i  and  t,  and  agi  and  eglt  are  independent  of  each  other.  However,  the  error  compo¬ 
nents  are  allowed  to  be  correlated  across  components,  so  that  Cov[a^-,  a  hi]  i=-  0  and 
Cov[e„,(,  si,it]  7^  0  for  g  ^  h.  Then  the  Chapter  21  methods  yield  consistent  estimates. 
The  obvious  single-equation  estimator  is  the  random  effects  estimator  that  is  feasible 
GLS  controlling  for  the  correlation  within  each  equation.  More  efficient  GLS  estima¬ 
tors  that  additionally  control  for  cross-equation  correlation  in  the  errors  are  detailed  in 
Avery  (1977)  and  Baltagi  (1980). 

Similar  efficiency  gains  can  be  found  when  the  system  is  one  of  simultaneous 
equations,  where  now  in  (22.31)  the  regressor  xg,-f  may  include  one  or  more  endoge¬ 
nous  regressors  y/,„  from  other  equations.  Then  IV  or  GMM  estimation  of  each  single 
equation  yields  consistent  estimates,  with  the  obvious  estimator  given  the  error  com¬ 
ponents  structure  being  the  random  effects  IV  or  EC2SLS  estimator  of  Section  22.4.3. 
More  efficient  estimates  are  obtained  by  systems  estimation,  using  the  error  compo¬ 
nents  three-stage  least-squares  (EC3SLS)  estimator  proposed  by  Baltagi  (1981). 

The  systems  estimators  are  more  difficult  to  implement  and  separate  estimation  of 
each  equation  may  be  adequate.  Even  if  this  simpler  approach  is  taken,  however,  much 
can  be  gained  in  specifying  a  system  of  simultaneous  equations  as  it  permits  identi¬ 
fication  of  the  coefficients  of  endogenous  regressors  using  as  instruments  exogenous 
regressors  excluded  from  the  equation  of  interest.  This  provides  a  more  traditional  ap¬ 
proach  to  obtaining  instruments  than  using  as  instruments  exogenous  regressors  from 
time  periods  other  than  the  current  one. 
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22.5.  Dynamic  Models 

In  this  section  we  consider  the  usual  individual-specific  effects  panel  data  model,  with 
the  complication  that  the  regressors  include  the  dependent  variable  lagged  once.  Then 
the  model  is  a  dynamic  model  with 

yit  =  yyu-1  +  ^it(3  +  oii  +sit,  i  —  1 . N,  t  =  l,...,T.  (22.32) 

As  usual  the  panel  is  short  with  data  independent  over  i.  It  is  assumed  that  \y\  <  1,  an 
assumption  relaxed  in  Section  22.5.4. 

An  important  result  is  that  even  if  a,  is  a  random  effect,  OLS  estimation  of  (22.32) 
leads  to  inconsistent  estimation  of  y  and  f3.  This  is  because  the  regressor  y,,_  i  is 
correlated  with  a,  and  hence  with  the  composite  error  term  (o',-  +  e,r).  Alternative 
estimators  are  needed  even  with  random  effects. 

We  consider  estimation  when  a,  is  a  fixed  effect,  y  \  <  1 ,  the  error  eu  is  serially 
uncorrelated,  and  the  panel  is  short  (see  Section  22.5.3).  Although  this  is  the  base 
case  for  microeconometrics  applications  there  exists  a  vast  literature  that  changes  one 
or  more  of  these  assumptions.  More  generally  the  individual-specific  effect  may  be 
purely  random,  errors  may  be  serially  correlated,  data  may  be  nonstationary,  and  the 
panel  may  be  a  long  panel,  but  we  barely  touch  on  this  literature. 


22.5.1.  True  State  Dependence  and  Unobserved  Heterogeneity 

Before  considering  estimation,  we  note  that  time-series  correlation  in  y,-f  is  now  in¬ 
duced  directly  by  yij~\  in  addition  to  the  indirect  effect  via  a,  already  considered 
in  Chapter  21.  These  two  causes  lead  to  quite  different  interpretations  of  correlation 
over  time  in,  for  example,  individual  earnings  or  welfare  recipiency. 

For  simplicity  let  (3  =  0  so  that  yit  =  yyyt-i+oii  +  sit.  Then  E[y,-,| a,-]  = 
yy,\t~i  +ar,-  and  Cor[y,( ,  yit  _  |  Icy,  |  =  y.  Conditional  on  a,-,  the  standard  time-series 
results  for  an  AR(1)  model  apply  with  dependence  over  time  in  yit  determined  solely 
by  the  autoregressive  parameter  y.  However,  a,  is  unknown  and  we  actually  ob¬ 
serve  B[  v,([y,,/-1  ]  =  yyi,t-i  +E[a!|y«,(-i]  and Cor[y/( ,  yu_i\  /  y.  Specifically,  from 
(22.32)  with  (3  =  0 


Cor[yf/,  yi,t~i]  =  Co r[yy,-,,_i  +  a,-  +  eit,  yitt- 1]  (22.33) 

=  Y  +  Cor[a(, 

a  -  y) 

y  1  +  (1  -  y)(T*/(l  +  y)o2a  ’ 

where  the  second  equality  assumes  Cor[e,( ,  _y,  ;_  1 1  =  0  and  the  third  equality  is  ob¬ 
tained  after  some  algebra  for  the  special  case  of  random  effects  with  sit  iid  [0,  rrj  \  and 
a,  iid  [0,  a2J. 

Result  (22.33)  makes  it  clear  that  there  are  two  possible  reasons  for  correlation 
between  ylt  and  ylt-\ . 

True  state  dependence  occurs  when  correlation  over  time  is  due  to  the  causal 
mechanism  that  last  period  determines  _y(/  this  period.  This  dependence  is 
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relatively  large  if  the  individual  effect  a,  ~  0  as  then  Cor[yir,  y,j-\  \  —  y .  More  gen¬ 
erally,  this  happens  when  a\  is  very  small  relative  to  a\. 

Correlation  due  to  unobserved  heterogeneity  arises  even  if  there  is  no  causal  mech¬ 
anism,  so  y  =  0,  but  nonetheless  there  is  correlation  since  Cor[y,; ,  yi  t_  ]]  simplifies  to 
(T^/Ccr-  +  cr;)  if  y  =  0,  as  in  Chapter  21. 

Both  extremes  permit  this  correlation  to  be  arbitrarily  close  to  one  because  ei¬ 
ther  y  — »  1  or  <J2/o2a  —>■  0.  However,  these  give  two  quite  different  explanations  with 
quite  different  policy  implications.  A  true  state  dependence  explanation  for  earnings 
y,,  being  continuously  high  over  time  even  after  controlling  for  regressors  x,t  is  that 
future  earnings  are  determined  by  past  earnings  and  y  is  large.  An  unobserved  het¬ 
erogeneity  explanation  is  that  actually  y  is  small,  but  important  variables  have  been 
omitted  from  xu,  leading  to  a  high  a,  in  each  time  period.  For  duration  data  the  dis¬ 
tinction  between  true  state  dependence  and  unobserved  heterogeneity  was  explored  in 
Chapter  18.  The  static  linear  panel  models  of  Chapter  21  considered  only  unobserved 
heterogeneity. 


22.5.2.  Inconsistency  of  Standard  Panel  Estimators 

The  estimators  from  the  previous  chapter  are  all  inconsistent  if  the  regressors  include 
lagged  dependent  variables,  even  in  the  case  of  the  random  effects  model.  We  consider 
estimation  of  the  model  given  in  (22.32),  where  the  literature  usually  assumes  that  sit 
are  serially  uncorrelated. 

First  consider  OLS  estimation  of  yit  on  yi>f_ i  and  xir.  The  error  term  is  then 
(a,  +  Si,),  which  is  correlated  with  the  regressor  yu-  \  since  lagging  the  equation  gives 
y,j-\  =  yyyt-2  +  x^  j/3  +  a,-  +  e,-, T-i,  so  that  yi<t_ i  is  correlated  with  a,-.  Note  that 
this  is  a  departure  from  earlier  results  for  OLS  estimation  of  the  random  effects  model 
without  lagged  dependent  variable,  as  then  OLS  of  yit  on  xit  yields  a  consistent,  albeit 
inefficient,  estimator.  This  is  also  a  departure  from  the  usual  OLS  result  that  regression 
of  \j,  on  y'ij-i  yields  a  consistent  estimate  (though  one  biased  in  small  samples)  if  the 
error  is  serially  uncorrelated. 

Second,  consider  the  within  estimator,  which  regresses  (y,-,  —  y,  )  on  (yy,_  i  —  y,,-i) 
and  (x„  —  x,).  This  regression  has  error  term  (e,,  —  £,).  Now  by  (22.32),  y,-f  is  corre¬ 
lated  with  Sit,  so  ytj-i  is  correlated  with  and  hence  However,  this  implies  that 
the  regressor  (y,-jf_  i  —  y, )  is  correlated  with  the  error  (sit  —  s,  ).  Thus  OLS  estimation 
of  the  within  model  leads  to  inconsistent  parameter  estimates,  because  the  regressor  is 
correlated  with  the  error  term.  Consistency  requires  that  s,  becomes  very  small  relative 
to  sit,  which  requires  T  — »■  oo,  which  occurs  in  long  panels  but  not  in  short  panels.  A 
leading  reference  is  Nickell  ( 1981). 

Inconsistency  also  arises  for  the  random  effects  estimator  given  in  Chapter  21, 
since  this  is  a  linear  combination  of  the  within  and  between  estimators.  For  random 
effects  models  Anderson  and  Hsiao  (1981)  instead  considered  ML  estimation  when 
Su  ~  7V[0,  cr2];  see  also  Bhargava  and  Sargan  (1983).  In  short  panels  the  distribution 
of  the  MLE  depends  on  the  assumptions  made  on  y,o,  the  initial  value  of  the  dependent 
variable.  Anderson  and  Hsiao  (1981)  distinguish  among  the  following  initial  condi¬ 
tion  assumptions:  (1)  fixed  initial  observations,  (2)  random  initial  observations  with  a 
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common  mean,  (3)  random  initial  observations  with  different  means,  and  (4)  random 
initial  observations  with  a  stationary  distributions. 

The  first  differences  OLS  estimator  is  also  inconsistent,  but  an  IV  variant  leads  to 
consistent  estimates.  We  now  present  this  estimator. 

22.5.3.  Arellano-Bond  Estimator 
Model  (22.32)  leads  to  the  first-differences  model 

yn  -  yi,t- i  =  y(yu- 1  -  yu- 2)  +  (x,-f  -  x,.,_ i)'/3  +  (sit  -  0,  t  =  2,  —  r. 

(22.34) 

The  OLS  estimator  is  inconsistent  because  y,,_i  is  correlated  with  from  (22.32), 
so  the  regressor  {yi,t-\  ~  yij-i)  is  correlated  with  the  error  (e,,  —  1 )  in  (22.34). 

Anderson  and  Hsiao  (1981)  proposed  estimating  (22.34)  using  the  instrumental 
variables  estimator  with  y,-,_2  as  an  instrument  for  (yu-  \  —  _y; , , —2 ) -  This  is  a  valid  in¬ 
strument,  since  y,,f-2  is  not  correlated  with  (e(/  —  1 )  assuming  the  errors  sit  are  se¬ 

rially  uncorrelated.  Furthermore,  yi  t~ 2  is  a  good  instrument  since  it  is  correlated  with 
(yij-i  —  ytj- 2).  The  method  requires  availability  of  three  periods  of  data  for  each  indi¬ 
vidual.  An  alternative  is  to  use  Ayi  t_2  as  an  instrument  for  Ay,.,- 1 ,  which  will  require 
four  periods  of  data.  Anderson  and  Hsiao  (1981)  present  results  suggesting  that  the 
IV  estimator  is  more  efficient  using  Ay,  ,_2  rather  than  y,,_2  as  the  instrument  in  the 
usual  case  that  y  >  0.  In  either  case  (x,r  —  x,-jf_  1)  is  used  as  an  instrument  for  itself. 

More  efficient  estimation  is  possible  by  using  additional  lags  of  the  dependent 
variable  as  instruments.  For  example,  both  y,,f-2  and  y,.,_3  might  be  used  as  instru¬ 
ments.  The  model  is  then  overidentified,  so  estimation  should  be  by  2SLS  or  panel 
GMM.  Furthermore,  the  number  of  instruments  available  is  highest  for  the  dependent 
variable  observed  at  time  t  closest  to  the  final  time  period  T.  In  period  3  only  yn 
is  available  as  an  instrument,  in  period  4  both  yn  and  y,i  are  available,  in  period  5 
yn,  y,2,  and  yn  are  available,  and  so  on.  Holtz-Eakin  et  al.  (1988)  and  Arellano  and 
Bond  (1991)  proposed  panel  GMM  estimators  using  these  wider  unbalanced  instru¬ 
ment  sets. 

The  microeconometrics  literature  refers  to  the  resulting  panel  GMM  estimator  as 
the  Arellano-Bond  estimator.  The  general  procedure  has  already  been  presented  in 
Section  22.4.2,  where  dynamics  were  not  explicitly  introduced.  The  estimator  is 

3ab  =  ^x;z;jw^f>;x,j  (&^zj  Wn  (Ez#)  •  <22-35) 

where  X,  is  a  (T  —  2)  x  (K  +  1)  matrix  with  ?th  row  (Ayi  r_i,  AxJr),  t  =  3, . . . ,  T,  y,- 
is  a  (  T  —  2)  x  1  vector  with  rth  row  Ay,,,  and  Z,  is  a  (T  —  2)  x  r  matrix  of  instruments 

"4  0  o' 

0  7^ 

Z,  =  /4  •  ,  (22.36) 

0 

_  0  0  z'iT_ 
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where  often  z'(  =  [y;jf_2>  y,->f_ 3, . . . ,  y,i,  Ax.,].  Lags  of  x,r  or  Ax,r  can  additionally  be 
used  as  instruments,  and  for  moderate  or  large  T  there  may  be  a  maximum  lag  of  yu 
that  is  used  as  an  instrument,  such  as  not  more  than  yiit_ 4.  Two-stage  LS  and  two-step 
GMM  correspond  to  different  weighting  matrices  W,y  (see  Section  22.2.3). 

The  method  is  easily  adapted  to  an  AR(p)  model,  with  yyu-  \  in  (22.32)  replaced 
by  Yi~yi,t- 1  +  Y2yi,t-2  +  ■  ■  •  +  y;,  though  more  than  three  periods  of  data  will  be 
needed  to  permit  consistent  estimation. 

The  empirical  example  in  Section  22.3  is  essentially  an  Arellano-Bond  estimation 
example,  since  a  first  differences  model  is  estimated  by  IV  with  lagged  regressors  used 
as  instruments. 

Ahn  and  Schmidt  (1995)  noted  that  more  efficient  estimation  is  possible  using  ad¬ 
ditional  moment  conditions.  Consider  the  pure  time- series  version  of  (22.32)  where 
(3  =  0,  and  make  the  standard  assumption  that  is  uncorrelated  with  a,,  els  for 
,v  7^  t  and  the  initial  observation  y,i.  The  Arellano-Bond  estimator  uses  the  mo¬ 
ment  conditions  E[y(J  Am,,  \  =  0  for  .v  <  t  —  2,  where  uu  =  s it  +  cti  ■  Ahn  and  Schmidt 
(1995)  obtain  a  more  efficient  estimator  by  additionally  using  the  moment  conditions 
E[uiTAuit]  =  0.  They  show  that  this  estimator,  which  makes  efficient  use  of  the  sec¬ 
ond  moment  assumptions,  is  asymptotically  equivalent  to  the  optimal  minimum  dis¬ 
tance  estimator  of  Chamberlain  (1982,  1984). 

Additional  assumptions  lead  to  additional  moment  conditions  and  hence  more  effi¬ 
cient  estimation.  If  V[e,,]  =  V[e,v |  then  E[h,  An„]  =  0  (see  Ahn  and  Schmidt,  1995), 
assuming  homoskedasticity  of  e„.  Arellano  and  Bover  (1995)  propose  using  the  condi¬ 
tion  E[«,-f  AyiS]  =  0  for  s  <  t  —  1.  Blundell  and  Bond  (1998)  consider  these  and  addi¬ 
tional  assumptions  and  show  that  the  benefit  can  be  considerable,  especially  when  y  is 
high  and  T  is  small.  Arellano  and  Honore  (2001)  present  many  assumptions  that  might 
be  made  and  the  corresponding  moment  conditions  that  can  be  used  in  estimation. 

Hsiao,  Pesaran,  and  Tahmiscioglu  (2002)  propose  a  transformed  ML  estimator. 
Assume  that  sit  are  iid  A/"[0,  a2],  an  assumption  that  can  be  relaxed.  Rather  than  form 
the  likelihood  based  on  e,  i, . . . ,  eiT,  they  form  the  likelihood  based  on  the  error  differ¬ 
ences  A sn,  . . . ,  A siT.  For  the  pure  time  series  AR(1)  model  A sit  =  A yit  —  y  A>yr_i 
for  t  >  1.  The  density  of  Ae,  1  depends  on  the  assumptions  made  about  initial  con¬ 
ditions:  either  Ae,-i  =  Ay,  1  or  Ae,i  =  Ay,i  —  b,  where  b  =  E[  Ay,  1 1  is  an  additional 
parameter  to  be  estimated.  The  resulting  estimator  is  a  quasi-MLE  that  retains  consis¬ 
tency  even  if  eit  are  nonnormal.  If  e,;  are  iid  [0,  a2\  then  the  transformed  MLE  is  more 
efficient  than  the  preceding  GMM  estimators. 


22.5.4.  Estimation  of  Covariance  Structures 

Covariance  structures  are  models  that  specify  a  structure  for  the  covariance  matrix  of 
the  regression  error.  Applications  include  structures  for  error  dynamics  and  for  mea¬ 
surement  error.  The  goal  is  to  estimate  the  parameters  of  the  structure. 

As  an  example,  suppose  that  y,-f  is  generated  by  a  random  effects  model  with  MA(  1 ) 
error,  so  that 


yu  =  oil  +  eu  +  <psi,,~  1. 
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where  a,-  ~  [0,  <r2  |  and  stI  ~  [0,  a2  ]  and  1 0 1  <  1 .  Then  the  autocovariances  y  ■  = 
Cov[y,f ,  >’,(_/]  satisfy  y0  =  +  (1  +  02)ct2,  y1  =  or2  +  0ct2,  and  yy  =  ct2  for  j  > 
2.  If  T  =  3  these  equations  yield  estimates  a2 ,  er2,  and  <p  given  autocovariance  es¬ 
timates  y0,  y  | ,  and  y2.  If  T  >3  the  model  is  overidentilied  as  there  are  only  three 
variance  parameters  to  estimate  but  more  than  three  autocovariance  estimates.  An  ob¬ 
vious  estimator  is  the  minimum  distance  estimator. 

In  general  let  6  denote  the  q  structural  parameters  and  suppose  g (9)  =  7,  where 
7  =  [y0,  . . . .  yT_l}'  is  the  vector  of '/’  >  q  autocovariances.  Then  the  minimum  dis¬ 
tance  estimator  0MD  minimizes 

Qn(G)  =  (7— g(0)V Ww(7  -  g(0)),  (22.37) 

where  7  =  [y,, . . . ,  yr_j]', 

T  N 

Yj  =  i n(t  -  j)]-1  !>'■'  -  ytXyu-j  -  y<-j ).  (22.38) 

<=7+1 »=i 

and  vf_;  =  A-1  JT  Yij-j-  The  weighting  matrix  W,v  and  further  details  on  MD  es¬ 
timation  are  provided  in  Section  6.7.  The  restrictions  of  the  model  can  be  tested  by 
use  of  the  chi-squared  test  statistic  given  in  Section  6.7.  The  discussion  thus  far  has 
already  imposed  the  restriction  of  covariance  stationarity.  One  can  more  generally  per¬ 
mit  ytj  /:  y  ■  for  /  /  s,  where  y tJ  =  Covfy,,,  y,-.,-,].  Then  7  has  T(T  +  l)/2  entries 
ytj,  t  =  j  +  1, . . . ,  T  and  j  =  0,  . . . ,  T  —  1.  The  stationarity  assumption  is  itself  a 
testable  assumption.  Moreover,  regressors  can  be  incorporated  by  replacing  yit  by  the 
residual  yit  -  x'it(3. 

Abowd  and  Card  (1989)  provided  an  early  application  of  this  approach  to  joint 
modeling  of  earnings  and  hours.  Altonji  and  Segal  (1996)  demonstrated  that  the  opti¬ 
mal  MD  estimator  can  be  quite  biased  in  finite  samples  (see  Section  6.3.5).  Many  of 
the  applications  are  to  models  of  earnings;  see  Baker  and  Solon  (2003)  for  a  recent 
example. 

The  MD  approach  is  well  suited  to  estimation  of  covariance  structures.  The  panel 
data  sets  can  be  large,  but  by  first  estimating  the  autocovariances  the  estimation  is 
reduced  to  minimizing  (22.37).  Other  estimation  approaches  are  possible.  In  particular, 
see  MaCurdy  (1982b),  who  presents  Box-Jenkins  type  models  for  panel  data. 


22.5.5.  Nonstationary  Panels 

The  panel  literature  on  unit  roots  and  nonstationarity  emphasizes  panels  where  both  N 
and  T  are  large.  For  unit  root  tests  a  key  early  paper  is  that  by  Levin  and  Lin  (1992), 
ultimately  published  as  Levin,  Lin,  and  Chu  (2002);  Pesaran  and  Smith  (1995)  wrote 
an  early  paper  that  considered  cointegration.  Phillips  and  Moon  (1999)  and  Pedroni 
(2004)  provide  general  theory  for  inference  with  nonstationary  panel  data.  Analysis  is 
simplest  using  a  sequential  limit  theory  where,  say,  first  N  is  fixed  and  T  — >  00  and 
subsequently  N  — >  00.  A  more  robust  approach  uses  joint  limits  where  T  — >  00 
and  N  — >  00  simultaneously.  Recent  reviews  of  the  literature  include  those  by  Phillips 
and  Moon  (2000)  and  Baltagi  (2001,  Chapter  12). 
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Less  consideration  has  been  given  to  nonstationary  data  in  short  panels.  Harris  and 
Tzavalis  (1999)  consider  the  unit  root  tests  of  Levin  and  Lin  (1992)  in  short  panels. 
Let  y  denote  the  within  estimate  of  y  in  the  AR(1)  fixed  effects  model  yit  =  a,  + 
yy,j-i  +  Sit,  where  sit  ~  Af[0,  a2].  We  consider  the  null  hypothesis  of  a  unit  root,  so 
y  =  1,  and  no  intercept  a,  =  0,  which  corresponds  to  the  pure  time  series  case  2  in 
Hamilton  (1994,  p.  490).  Under  this  null  hypothesis  the  unit  root  test  statistic 


VN(y  -  1  +  3/{T  +  1)) 

[3(17T2  -  20 T  +  17)]/[5CT  -  1  )(T  +  l)3] 


4  7U[0,  1] 


as  N  -a-  oo  for  fixed  7  .  Large  negative  values  of  this  statistic  lead  to  rejection  of  the 
unit  root  hypothesis.  Levin  and  Lin  (1992)  provide  additional  tests,  such  as  for  models 
with  individual  time  trends. 

Binder,  Hsiao,  and  Pesaran  (2003)  consider  short  panel  estimation  of  fixed  effect 
dynamic  panel  models  with  unit  roots  and  cointegration.  With  unit  roots  the  Arellano- 
Bond  estimator  is  inconsistent,  though  the  extensions  due  to  Ahn  and  Schmidt  (1995) 
and  others  discussed  at  the  end  of  Section  22.5.3  yield  consistent  estimates.  Binder 
et  al.  (2003)  propose  quasi-ML  estimators  that  perform  better  in  finite  samples  when 
unit  roots  are  present. 


22.6.  Difference-in-Differences  Estimator 


The  evaluation  literature  presented  in  Chapter  25  focuses  on  measuring  the  treatment 
effect,  in  the  simplest  case  the  impact  or  marginal  effect  of  a  single  binary  regressor 
that  equals  one  if  treatment  occurs  and  equals  zero  if  treatment  does  not  occur.  For 
example,  interest  may  lie  in  measuring  the  effect  on  earnings  of  a  policy  change  (the 
binary  treatment)  that  alters  tax  rates  or  welfare  eligibility  or  access  to  training  for 
some  individuals  but  not  for  others. 

In  this  section  we  relate  one  of  the  methods  of  Chapter  25  to  panel  methods.  Specif¬ 
ically  the  treatment  effect  can  be  measured  using  standard  panel  data  methods  if  panel 
data  are  available  before  and  after  the  treatment  and  if  not  all  individuals  receive  the 
treatment.  Then  the  first-differences  estimator  for  the  fixed  effects  model  reduces  to 
a  simple  estimator  called  the  differences-in-differences  estimator,  introduced  in  Sec¬ 
tion  3.4.2  and  also  studied  in  Section  25.5.  The  latter  estimator  has  the  advantage  that 
it  can  also  be  used  when  repeated  cross-section  data  rather  than  panel  data  are  avail¬ 
able.  However,  it  does  rely  on  model  assumptions  that  are  often  not  made  explicit.  The 
treatment  here  follows  Blundell  and  MaCurdy  (2000). 


22.6.1.  Fixed  Effects  with  Binary  Treatment 
Let  the  binary  regressor  of  interest  be 


Dit  = 

Assume  a  fixed  effects  model  for  y,,  with 


1  if  individual  i  receives  treatment  in  period  t, 
0  otherwise. 


(22.39) 


yu  —  <pDjt  +  S,  +  a,-  +  Sn, 
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where  8,  is  a  time-specific  fixed  effect  and  a,  is  an  individual-specific  fixed  effect.  As 
noted  in  Section  21.2.1  this  is  equivalent  to  regression  of  yit  on  Dit  and  a  full  set  of 
time  dummies  with  the  complication  of  individual-specific  fixed  effects.  For  simplicity 
there  are  no  other  regressors. 

The  individual  effects  a,  can  be  eliminated  by  first  differencing.  Then 


Ay?/  —  (pADj,  +  (8,  —  <5f_i)  +  As,/.  (22.41) 

The  treatment  effect  </>  can  be  consistently  estimated  by  pooled  OLS  regression  of  A yu 
on  ADj,  and  a  full  set  of  time  dummies. 


22.6.2.  Differences  in  Differences 

Now  consider  specialization  to  only  two  time  periods.  Furthermore,  suppose  treatment 
occurs  only  in  period  2,  so  that  in  period  1  D,  \  =  0  for  all  individuals  and  in  period  2 
Dn  =  1  for  the  treated  and  Da  =  0  for  the  nontreated.  Then  the  subscript  t  can  be 
dropped  from  (22.41)  and 


Ay,-  —  (pDj  +  8  +  v, ,  (22.42) 

where  l),  is  a  binary  treatment  variable  indicating  whether  or  not  the  individual  re¬ 
ceived  treatment. 

The  treatment  effect  can  be  estimated  by  OLS  regression  of  Ay  on  an  intercept  and 
the  binary  regressor  D.  Define  A  y11  to  denote  the  sample  average  of  Ay,  for  the  treated 
(D,  =  1)  and  Aynt  to  denote  the  sample  average  of  Ay,  for  the  nontreated  ( D,  =  0). 
Then  the  OLS  estimator  reduces  to 

0  =  Ay11  -  Aynt.  (22.43) 

This  estimator  is  called  the  differences-in-differences  (DID)  estimator,  since  one 
estimates  the  time  difference  for  the  treated  and  untreated  groups  and  then  takes  the 
difference  in  the  time  differences. 

The  estimator  is  appealing  for  its  intuitive  simplicity.  Additionally,  it  can  be  ex¬ 
tended  from  panel  data  to  the  case  where  separate  cross  sections  are  available  in  the 
two  periods.  In  the  second  period  compute  the  averages  yf  and  y"1  for  the  treated  and 
untreated  groups.  Compute  similar  averages  yf  and  yf  in  the  first  pretreatment  period. 
This  assumes  that  it  is  possible  to  identify  in  the  first  period  whether  or  not  an  individ¬ 
ual  is  eligible  for  treatment.  This  is  easy  if,  for  example,  the  treatment  applies  only  to 
women  and  data  on  gender  are  available.  Then  compute 

0  =  (y* -  yf) ~  (yf  -  yf).  (22.44) 

As  an  example,  if  average  annual  earnings  for  the  group  eligible  for  treatment  equals 
10,000  before  treatment  and  13,000  after  treatment  then  y f  —  yf  =  3,000.  Similarly, 
if  average  annual  earnings  for  the  group  not  eligible  for  treatment  equals  15,000  before 
treatment  and  17,000  after  treatment  then  yf  —  yf1  =  2,000.  The  DID  estimate  of  the 
treatment  effect  </>  is  then  3,000  —  2,000  =  1,000. 
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22.6.3.  Assumptions  Underlying  Differences  in  Differences 

The  preceding  formulation  of  the  DID  estimator  makes  explicit  the  underlying  as¬ 
sumptions  for  consistent  estimation  of  0. 

First,  it  is  assumed  that  the  time  effects  8,  are  common  across  treated  and  untreated 
individuals.  For  example,  time  trends  may  differ  by  gender,  in  which  case  identifying 
0  is  problematic  if  treatment  depends  on  gender.  The  common  trends  assumption  is 
needed  if  either  panel  or  cross-section  data  are  used. 

Second,  if  cross-section  data  are  used  then  the  composition  of  the  treated  and  un¬ 
treated  groups  is  assumed  to  be  stable  before  and  after  the  change.  With  panel  data 
differencing  eliminates  the  fixed  effects  a, .  With  repeated  cross-section  data  the  origi¬ 
nal  model  (22.40)  implies  that  yf  =  0  +  8t  +  djr  +  ef  and  yfnt  =  8t  +  af  +  e"1.  Given 
that  treatment  only  occurs  in  the  second  period  it  follows  that 

0  =  (y*  -  yf)  -  (y2nt  -  yf)  +  (ctf  -  a?)  -  (a"  -  af )  +  «, 

where  v  =  (sf  —  ef)  —  (e"1  —  ef).  Consistency  of  0  in  (22.44)  occurs  if  plim(d2  — 
<S\r)  =  0  and  plim(«2  —  a"1)  =  0.  This  will  happen  if  assignment  to  treatment  is  ran¬ 
dom.  Flowever,  often  this  is  not  the  case. 


22.6.4.  Richer  Models 

In  practice  richer  models  are  used.  An  obvious  extension  is  to  include  regressors 
other  than  the  treatment  indicator  and  time  dummies.  By  grouping  data  the  individual- 
specific  effects  can  at  least  be  permitted  to  differ  on  average  across  groups.  The  general 
procedure  is  to  estimate 

yigt  =  <pDigt  +  S,  +  a,  +  Sj,, 

where  g  denotes  the  gth  group. 

In  a  classic  example  of  DID  estimation,  Card  (1990)  studied  the  effect  on  unemploy¬ 
ment  of  low-wage  workers  in  Miami  of  a  sudden  influx  of  immigrants  from  Cuba.  This 
example  is  also  reviewed  in  Angrist  and  Krueger  (1999).  Athey  and  Imbens  (2002) 
present  extension  to  nonlinear  models. 


22.7.  Repeated  Cross  Sections  and  Pseudo  Panels 

The  key  potential  advantages  of  panel  data  arise  from  being  able  to  observe  subjects 
over  time.  This  makes  it  possible  to  control  for  unobserved  individual  heterogeneity, 
differences  in  initial  conditions,  and  dynamic  dependence  of  outcomes.  In  many  cases, 
however,  genuine  panel  data  are  unavailable. 


22.7.1.  Repeated  Cross  Sections 

We  consider  analysis  when  data  are  for  several  repeated  cross  sections,  derived  from 
responses  to  a  series  of  independent  sample  surveys,  where  independence  means  that 
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each  subject  appears  in  only  one  survey.  An  example  is  the  U.K.  Family  Expenditure 
Survey,  which  collects  a  large  annual  sample  of  household  expenditure  data  but  each 
year  surveys  different  families.  Also,  if  only  a  very  short  panel  is  available  (e.g.,  T  = 
2)  then  data  from  repeated  cross  sections  are  appealing  if  they  can  generate  a  larger 
and  richer  sample. 

For  a  random  effects  model  repeated  cross-section  data  pose  no  challenges.  One 
simply  performs  a  pooled  regression  of  yit  on  xit  (see  Section  21.5)  and  statistical 
inference  is  actually  simplified  as  correction  is  needed  only  for  heteroskedasticity  since 
here  errors  are  independent  over  both  i  and  t. 

With  fixed  effects,  however,  pooled  regression  leads  to  inconsistent  parameter  es¬ 
timates.  Furthermore,  alternative  methods  such  as  the  within  or  first-differences  es¬ 
timation  are  infeasible  if  individuals  are  observed  at  only  one  point  in  time.  In  this 
section  repeated  cross-section  data  are  used  to  construct  pseudo  panels  or  synthetic 
panel  data  that  have  some  of  the  advantages  of  genuine  panel  data,  most  notably  the 
ability  to  control  for  fixed  effects.  A  special  case  is  the  DID  estimator  presented  in 
Section  22.6. 


22.7.2.  Pseudo  Panels 

Browning,  Deaton,  and  Irish  (1985)  and  Deaton  (1985),  in  their  empirical  studies 
based  on  the  U.K.  Family  Expenditure  Survey,  considered  methods  for  analyzing  re¬ 
peated  cross-section  data.  Their  suggestion  was  to  convert  the  individual-level  data 
into  cohort-level  data.  Although  individual  household  expenditures  cannot  be  tracked 
through  time,  it  is  possible  to  do  so  for  cohorts  of  individuals. 

A  cohort  is  defined  as  “a  group  with  fixed  membership,  individuals  of  which  can 
be  identified  as  they  show  up  in  the  surveys”  (Deaton,  1985,  p.  109).  An  example  is  an 
age  cohort  such  as  males  born  between  1965  and  1970.  For  large  samples,  successive 
surveys  will  generate  random  samples  of  members  of  each  cohort. 

Time  series  of  sample  averages  of  cohorts  can  form  the  basis  of  regression  models. 
Whether  synthetic  panels  based  on  cohort  data  can  substitute  for  genuine  panel  data 
is  a  key  issue.  The  topic  of  repeated  cross  section  deals  with  inference  procedures  for 
such  models.  Here  we  focus  on  static  pseudo  panel  models.  Collado  (1997)  and  Girma 
(2000)  also  consider  the  dynamic  case. 

The  starting  point  is  the  static  linear  regression  with  individual  fixed  effects  a,, 
based  on  T  successive  cross  sections, 

yu  =  oti  +  x 'ir/3  +  ult,  t  —  1 . T.  (22.45) 

The  explanatory  variables  are  assumed  to  be  strongly  exogenous  with  respect  to  pa¬ 
rameters  of  interest,  (3 ,  so  E[x'(m,v|  =  0,  Vf,  s.  For  simplicity,  we  assume  that  N  ob¬ 
servations  are  available  for  each  cross  section.  Each  individual  is  observed  in  only  one 
time  period,  so  the  individual-specific  effects  a,  cannot  be  swept  out  by  differencing 
the  individual-level  data. 

Let  g  be  a  random  variable  that  determines  cohort  membership  for  each  i,  such  that 
i  belongs  to  cluster  c  if  and  only  if  gj  belongs  to  the  set  lc.  Assume  that  there  are  C 
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cohorts,  and  c  is  the  cohort  subscript,  c  =  1 . C.  Taking  expectations  conditional 

on  gj  yields 

E[y„lu  e  I t]  =  E  [at\g,  e  Ic]  +  E  [x'r|g/  e  Ic]  (3+  E  [uit |g,  e  Ic] .  (22.46) 

This  generates  a  cohort  population  version  of  the  model  (22.45)  given  by 

y*ct  =  a*c+x*c'tf3  +  u*ct,  (22.47) 

where  the  asterisks  denote  unobservable  population  cohort  averages.  For  example, 

y*t  =  Elyit\gi  e  lc\- 

The  parameter  a*  =  Efcr,  |g,  e  IL  ]  is  the  cohort  fixed  effect.  An  important  assump¬ 
tion  made  in  the  case  of  fixed  effects  is  that  the  population  is  stationary  so  that  a*  can 
be  assumed  to  be  constant  over  time.  This  is  qualitatively  similar  to  the  assumption 
needed  for  consistency  of  the  DID  estimator  made  at  the  end  of  Section  22.6.3.  Under 
the  usual  weak  exogeneity  assumptions  E[n*,|x*,]  =  0.  However,  the  unobserved  fixed 
effect  a*,  will  be  correlated  with  x*;  if  o',  is  correlated  with  x,,  in  the  original  model 
(22.45).  Estimation  needs  to  control  for  the  fixed  effect. 

In  practice  the  population  cohort  means  are  unobservable  and  we  instead  work  with 
cohort-time  averages  yct  and  xc..  The  regression  is  then 

yet  =  &c  +Zct/3+uct,  c  —  1, . . .  C,  t  =  l, . . .  ,T.  (22.48) 

This  step  introduces  an  additional  source  of  error,  since  yct  and  xc/  are  error- 
contaminated  estimates  of  the  population  cohort  averages,  that  is. 


yct  =  y*ct  +  ^ct,  (22.49) 

xC(  =  X*  +  vct. 

If  the  measurement  error  is  very  small,  owing  to  the  number  of  observations  per 
cohort  per  time  period  ( Nct )  being  very  large,  then  yct  —  y*t  and  xr/  =  x*f  and  the 
measurement  error  can  be  ignored.  A  consistent  estimate  of  /3  can  be  obtained  by 
within  estimation  of  (22.48),  that  is,  OLS  regression  of  ( yct  —  yc )  on  (xr,  —  xc),  where 
%  =  T~lJ2,  yct  and  xc  =  T  1  J2t  *cf 

Unfortunately,  the  measurement  error  is  often  too  large  to  ignore.  Then  within  es¬ 
timation  of  (22.48),  or  even  OLS  estimation  of  (22.48)  when  ac  is  a  random  effect, 
leads  to  inconsistent  estimation  of  /3.  Instead,  errors-in- variables  estimators  need  to  be 
used.  These  can  be  implemented  here  since  the  individual-level  data  yield  necessary 
estimates  of  the  moments  of  the  measurement  error,  see  Section  26.3.3. 


22.7.3.  Measurement  Error  Estimators  for  Pseudo  Panels 

A  classic  solution  to  measurement  errors  is  to  use  replicated  observations  to  estimate 
the  covariance  matrix  of  the  measurement  error,  and  to  then  use  these  estimates  to 
“correct”  the  sample  moments  of  the  error-contaminated  variables  before  applying 
the  least-squares  procedure  (see  Section  26.3.4).  Deaton  (1985)  proposed  using  this 
method  in  the  current  setting. 
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22.7.  REPEATED  CROSS  SECTIONS  AND  PSEUDO  PANELS 


Assume  that  individual  observations  satisfy  the  equations 


yu  =  y*c,  +  eh 

x,7  =  x*,  +  J7,.„ 


a  setup  similar  to  that  in  Section  26.2.1,  except  that  there  is  also  measurement  error  in 
the  dependent  variable,  and  assume  that  for  any  individual  in  a  given  cohort  c, 


Si, 

~  iid 

"0" 

r  _2  n 

CT0  ^01 

.  Vi, . 

0 

_cr oi  S 

Sample  estimates  of  (X,  er01),  denoted  (S,  eroi),  can  be  obtained  given  (yct,xct)  from 
using  all  individual-level  data.  Define  d,  to  be  the  Cxi  column  vector  of  dummy 
variables  corresponding  to  the  fixed  effects  a*  (see  Section  21.2.1),  which  is  a  regres¬ 
sor  vector  that  is  clearly  not  subject  to  estimation  error.  Then  provided  T  is  sufficiently 
large  and  the  relevant  inverses  exist,  the  regression 


Olct 

fie, 


d'd,  d  xr,  ^ 

KA  KA,  -  s 


(22.50) 


will  provide  consistent  estimates  of  the  cohort  regression  as  CT  — >•  oo.  This  estimator 
is  the  same  as  that  given  in  Section  26.3.4,  with  adaptation  here  because  ycr  is  also  mea¬ 
sured  with  eiTor  and  with  simplification  because  only  a  subset  of  the  regressors,  xct,  is 
measured  with  error.  Verbeek  and  Nijman  (1992)  provide  a  more  detailed  discussion 
of  the  sampling  properties,  and  Deaton  (1985)  presents  variance  estimation.  See  also 
Verbeek  (1995). 

The  preceding  estimator  essentially  controls  for  the  cohort  fixed  effects  by  estimat¬ 
ing  the  least-squares  dummy  variable  model,  adjusting  for  measurement  error  by  use 
of  replicated  data  using  the  estimator  given  in  Section  26.3.4. 

Collado  (1997)  considered  an  alternative  approach  of  eliminating  the  cohort  effects 
by  first  differencing,  and  then  controlling  for  measurement  error  through  instrumental 
variables  estimation,  an  alternative  identification  strategy  for  measurement  error  given 
in  Section  26.3.2. 

Substituting  (22.49)  into  (22.47)  gives 


ya  -  He  =  «*  +  {K,  -  v'c )  fi  +  u*c 

yet  =  a*  +  x'cr/3+wct, 


where  the  error  wct  =  u*t  —  v'ct/3  +  He-  First  differencing  eliminates  a*,  leading  to 

A yct  =  Ax'ct/3  +  A wct,  t  =  2,  . . . ,  T.  (22.51) 

Now  because  of  the  measurement  error  terms  the  explanatory  variables  Ax',(  will  be 
correlated  with  A  wcl ,  and  hence  applying  least  squares  will  lead  to  inconsistent  esti¬ 
mation.  Consistent  estimates  can  be  obtained  by  IV  estimation  based  on  lagged  levels 
of  exogenous  variables,  that  is,  x',  t_l.  This  approach  has  the  advantage  of  ready  ex¬ 
tension  to  models  with  lagged  dependent  variables.  For  details  see  Collado  (1997). 
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22.8.  Mixed  Linear  Models 

The  model  called  the  random  effects  model  by  econometricians  specifies  only  the  in¬ 
tercept  coefficient  to  be  random.  Richer  random  effects  models,  widely  used  in  other 
areas  of  applied  statistics,  additionally  permit  the  slope  parameters  to  be  random.  In 
this  section  we  present  mixed  linear  models  -  also  called  mixed  effects  models,  hierar¬ 
chical,  or  multilevel  linear  models  (see  Chapter  24),  random  coefficients  models,  and 
variance  components  models. 

These  models  are  applied  in  a  setting  where  the  pooled  OLS  estimator  is  still  con¬ 
sistent.  In  particular,  there  are  no  fixed  effects.  Because  the  mixed  linear  models  frame¬ 
work  provides  enough  structure  to  permit  estimation  by  feasible  GLS,  its  estimates  are 
more  efficient. 


22.8.1.  Mixed  Linear  Models 
The  mixed  linear  model  specifies 

yit  =  z!itP  +  w 'itan  +  eit,  (22.52) 

where  the  regressors  zit  include  an  intercept,  w,r  is  a  vector  of  observable  characteris¬ 
tics,  a,  is  a  random  zero-mean  vector,  and  sir  is  an  error  term.  This  model  is  called  a 
mixed  model  as  it  has  both  fixed  parameters  /3  and  zero-mean  random  parameters 
or  random  effects  cr, . 

The  random  intercept  model  =  zJ,/3  +  a,  +  elt  is  a  special  case  of  (22.52)  with 
W  it°ti  =  <Xi. 

Another  special  case  of  (22.52)  is  the  random  coefficients  model  or  random  pa¬ 
rameters  model.  In  the  regression  setting  we  suppose  that 

ytt  =  +  £it, 

a  regular  linear  regression,  except  that  the  regression  parameter  vector  now  differs 
across  individuals  according  to 


Pi  =  P  +  atj, 

where  a,  is  a  zero-mean  random  vector.  Substitution  yields  yir  =  z'jrP  +  z'ita +  sit, 
which  is  (22.52)  with  w,r  =  zit. 

Many  applications  lie  between  random  intercept  and  random  coefficients  models, 
with  w,r  often  a  subset  of  z„.  In  particular,  standard  mixed  and  random  ANOVA  mod¬ 
els  are  also  a  special  case,  where  the  Ath  component  of  the  vector  w,,  is  either  zero  or 
one,  according  to  various  possible  models  for  clustering  the  data.  For  example,  one  of 
the  components  in  z„  may  be  a  race  or  gender  indicator  variable.  Then  the  conditional 
mean  of  yit  varies  with  gender  or  race.  It  may  also  be  felt  that  the  conditional  variance 
of  yit  also  varies  with  gender  or  race,  which  can  be  captured  by  inclusion  in  w,,.  The 
mixed  model  is  an  outgrowth  of  ANOVA  models.  The  hierarchical  linear  model  or 
multi-level  linear  model  (see  Section  24.6.2)  can  also  be  expressed  as  a  special  case 
of  (22.52). 
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22.8.2.  Estimation 

The  goal  is  to  estimate  the  fixed  regression  parameters  (3  and  the  variances  and  co- 
variance  parameters  of  the  distributions  for  cc,  and  ea.  One  of  the  early  treatments  of 
this  model  was  in  a  Bayesian  context  by  Lindley  and  Smith  (1972).  A  simple  example 
of  their  general  treatment  was  the  random  coefficients  model  with  yit  ~  Af[z'(/3(,  o2\. 
where  /3,  ~  Af[-y,  T|.  See  Koop  (2003),  for  example,  for  Bayesian  analysis  of  the 
linear  panel  data  model. 

Here  we  follow  the  classical  approach,  based  on  the  work  of  Harville  (1977),  who 
gives  references  to  the  earlier  literature.  The  mixed  model  (22.52)  can  be  split  into 
a  deterministic  component  x';/3  and  a  random  component  w'(  rv,  +  £,, .  The  stochastic 
assumptions  include  the  assumption  that  the  regressors  x,  ,  are  independent  of  the  zero- 
mean  random  components  ct,  and  £,,.  So  pooled  OLS  regression  of  y,r  on  x,,  provides 
consistent  estimates  of  (3.  We  are  essentially  in  the  world  of  Section  21.5,  with  feasible 
GLS  estimation  possible  as  structure  has  been  placed  on  the  variance  matrix  of  the 
error  term  w-fa,-  +  £(f.  In  this  section  we  present  the  feasible  GLS  estimator  along 
with  two  different  methods  to  estimate  the  variances  and  covariances  of  ct,  and  slt  and 
consider  prediction  of  the  random  components  ct, . 

Combine  observations  over  time  for  a  given  individual  in  the  usual  way,  so  that 
(22.52)  becomes 

y,  =  Z,(3  +  (W,a,  +  £,).  (22.53) 

The  usual  assumptions  are  that  ct,  and  e,  are  independent  over  i  and  independent  of 
each  other  with  ct,  ~  [0,  £„]  and  e,  ~  [0,  Ee],  so  that  the  error  term 

W ton  +  e,  ~  [0,  G,  =  W,-SaW'  +  £e]. 

Then  the  feasible  GLS  estimator  is 

N  ~|_1  N 

3fgls  =  ^Z/nr1*,  (22.54) 

_*= 1  _  1  =  1 

where  $2,  is  consistent  for  G, . 

Implementation  requires  consistent  estimation  of  $2,  .  This  has  already  been  dis¬ 
cussed  in  Section  21.7  for  the  simpler  case  of  a  random  intercept,  in  which  case  there 
were  several  different  ways  to  consistently  estimate  the  variance  components  a2a  and 
a\,  with  complications  such  as  bias  and  the  possibility  of  negative  estimates.  Similar 
issues  arise  here  in  estimation  of  SQ  and  S£. 

We  present  two  estimators  based  on  the  additional  assumption  of  normal  distribu¬ 
tion  for  the  random  components.  The  presentation  is  for  the  more  general  model 

y  =  Z  f3  +  (Wet  +  e),  (22.55) 

which  can  be  obtained,  for  example,  by  appropriate  stacking  of  (22.53).  It  is  assumed 
that  a  ~  J\f[0,  G]  and  e  ~  A/"[0,  R],  where  in  the  current  application  G  and  R  are 
functions  of  SQ  and  Ee.  The  feasible  GLS  estimator  for  the  mixed  model  is 

3fgls  =  [Z'V-'Z]-1  Z'V-'y, 

where  V  is  consistent  for  V  =  V[Wa  +  e]  =  WGW'  +  R.  See  Swamy  (1970). 
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The  obvious  method  for  obtaining  V  is  maximum  likelihood.  The  log-likelihood 
function  based  on  the  multivariate  normal,  after  concentrating  out  ft  which  is  equal  to 
the  GLS  estimator  [Z'V^Z]-1  Z'V_1y,  is 


1  NT  ,  ,  NT 

lnL(G,  R)  =  —  -  ln|V| - —  lnr'V_1r-  — 


1  +ln 


2n 

NT 


where  r  =  y  —  Z  [Z'V  1 Z]  1  Z'V  *y  and  |V|  denotes  the  determinant  of  V.  Maxi¬ 
mization  with  respect  to  the  parameters  in  G  and  R  yields  V  =  WGW'  +  R. 

A  weakness  of  ML  estimates  of  variance  components  are  that  they  are  biased  in 
small  samples.  For  example,  for  cross-section  linear  regression  with  homoskedastic 
errors  the  MLE  a2  =  N  1  JT  Tr  is  biased  and  it  is  better  to  instead  divide  by  ( N  —  K ). 
For  the  model  (22.53),  degree-of-freedom  corrections  are  provided  by  the  restricted 
maximum  likelihood  estimator  that  instead  maximizes 


1  NT  - 

In  LS(G,  R)  =  —  -  In  |  V| - - 

-iln|Z'V1Z|, 


lnr'V_1r  — 


NT 


1  +ln 


2tt 


NT-pJ  J 


where  p  is  the  rank  of  Z.  For  motivation  of  In  Lfl(G,  R)  see  Harville  (1977). 

As  an  empirical  example  of  a  mixed  linear  model,  consider  the  ln(hours)-ln(wage) 
regression  example  of  Section  21.3  with  both  the  intercept  and  slope  parameters  per¬ 
mitted  to  be  random.  Then  the  random  coefficients  model  yields  lnhrs  =  7.734  — 
0.0211nwg  with  slope  coefficient  standard  error  of  0.046  (default)  or  0.020  (panel  boot¬ 
strap).  The  slope  coefficient  is  quite  different  from  the  estimates  given  in  Table  21.2. 


22.8.3.  Prediction 

We  may  wish  to  predict  the  random  parameters  a  in  addition  to  the  fixed  parameters 
ft  and  the  covariance  parameters. 

The  joint  normal  equations  for  ft  and  a,  given  consistent  estimates  of  ft  and  a,  can 
be  written  as 


'  Z'R  *Z 

ZR “*W 

' ft ' 

'  Z'R-'y  ' 

W'R'Z 

WR  ‘W+G  1 

a. 

WRy 

Solving  for  ft  gives  /3FGLS  given  earlier,  whereas 

a  -  GW'r‘(y-  Z'ft). 

In  the  case  of  independence  over  i,  this  yields  3,  =  '(v,  —  Z'/3).  This  is  the 

best  linear  unbiased  predictor  if  the  variance  matrices  are  known. 


22.9.  Practical  Considerations 

The  panel  2SLS  estimators  can  actually  be  estimated  using  just  a  2SLS  program  for 
cross-section  data  (see  Section  22.2.5)  though  computed  standard  errors  need  to  be 
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panel  robust.  Optimal  GMM  estimators  can  be  implemented  using  matrix  commands 
in  a  statistical  package  or  in  a  programming  language  such  as  GAUSS.  Statistical  pack¬ 
ages  are  increasingly  offering  panel  commands  that  automatically  implement  the  esti¬ 
mators  of  this  chapter,  most  notably  the  Arellano-Bond  estimator. 

22.10.  Bibliographic  Notes 

This  chapter  covers  an  active  area  of  research  that  appears  in  several  recent  texts  devoted  to 
panel  data,  notably  those  by  Baltagi  (1995,  2001),  Hsiao  (1986,  2003),  M-J.  Lee  (2002),  and 
Arellano  (2003).  More  advanced  methods  are  given  in  Matyas  and  Sevestre  (1995)  and  in  Arel¬ 
lano  and  Honore  (2001). 

22.2  Chamberlain  (1982,  1984)  emphasized  the  use  of  exogeneity  assumptions.  He  used  min¬ 
imum  distance  estimation.  The  subsequent  literature  has  used  GMM  methods.  M-J.  Lee 
(2002)  and  Arellano  (2003)  especially  emphasize  GMM  estimation.  See  also  the  survey 
by  Ahn  and  Schmidt  (1999). 

22.4  The  model  of  Hausman  and  Taylor  (1981)  is  attractive.  By  assuming  that  some  regressors 
are  uncorrelated  with  the  individual-specific  effect  it  permits  identification  of  the  coeffi¬ 
cients  of  time-invariant  regressors. 

22.5  The  coverage  of  linear  dynamic  models  is  very  brief  compared  to  the  size  of  the  literature 
that  began  with  Balestra  and  Nerlove  (1966).  More  complete  discussions  are  given  in 
Baltagi  (2001,  Chapter  8),  Hsiao  (2003,  Chapter  4),  and  Arellano  (2003,  Chapter  5-8). 
The  Arellano-Bond  (1991)  estimator  is  especially  popular  as  it  accommodates  dynamic 
models  with  fixed  effects. 

22.6  The  difference-in-differences  approach  is  very  popular  because  of  its  simplicity.  Although 
it  can  be  used  with  repeated  cross-section  rather  than  panel  data,  a  panel  data  interpreta¬ 
tion  helps  make  explicit  the  underlying  assumptions.  Bertrand  et  al.  (2004)  demonstrate 
the  importance  of  correcting  for  time  series  correlation  at  the  individual  level  using  the 
methods  of  Section  22.2.3. 

22.8  Mixed  linear  models  are  especially  popular  in  the  statistics  literature.  They  are  less  used 
in  the  econometrics  literature,  because  of  the  reluctance  to  impose  structure  on  the  time- 
invariant  individual-specific  fixed  effect. 


- Exercises - 

22-1  Consider  the  panel  GMM  estimator  of  Section  22.2.1 . 

(a)  Show  that  minimization  with  respect  to  /3  of  the  quadratic  function  Qw(/3) 
given  after  (22.3)  yields  the  panel  GMM  esimator  given  after  Qa/(/3)  that  is 
expressed  using  summation  notation. 

(b)  Show  that  this  estimator  is  equivalent  to  the  estimator  defined  in  (22.4). 

(c)  For  simplicity  suppose  that  the  matrices  Z  and  X  in  (22.4)  are  nonstochastic 
and  that  y  =  X/3  +  u  where  u  has  mean  0  and  variance  £2.  Obtain  the  finite 
sample  variance  matrix  of  the  estimator  in  (22.4)  and  compare  this  to  the 
asymptotic  results  in  (22.5). 

(d)  Simplify  the  panel  GMM  estimator  in  the  case  that  r  =  K. 

22-2  Consider  the  panel  data  model  yit  =  a  +  /3xit  +  ywit  +  uit,  i  =  1 _ _  N,t  = 

1 , ,T,  where  for  simplicity  there  is  no  individual-specific  effect.  Suppose  the 
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scalar  regressor  xit  is  correlated  with  u/s  for  all  t  and  s.  For  each  of  the  following 
parts  state  whether  consistent  IV  estimation  of  p  and  y  is  possible,  and  if  so  give 
all  the  suitable  instruments,  based  on  the  discussion  in  Section  22.2.  Assume 
that  three  periods  of  data  are  available,  so  T  =  3,  and  note  that  a  variable  may 
not  be  available  as  an  instrument  in  all  years,  and  that  in  different  years  different 
instruments  may  be  available. 

(a)  The  regressor  wit  satisfies  the  summation  assumption  E[£f  witUit]  =  0. 

(b)  The  regressor  wit  satisfies  the  contemporaneous  exogeneity  assumption 
E [WjtUit]  =  0,  t  =  1, . . .  ,3. 

(c)  The  regressor  wit  satisfies  the  weak  exogeneity  assumption  E [wisUit]  = 
0,  s<  t,  t  =  1, . . .  ,  3. 

(d)  The  regressor  wit  satisfies  the  strong  exogeneity  assumption  E [witUit]  = 
0,  s,  t  =  1 . . . .  ,  3. 

22-3  Repeat  question  3,  again  with  three  periods  of  data,  but  now  consider  the  panel 
data  model  yit  =  a-,  +  j5xit  +  ywit  +  uit,  where  a,-  is  a  fixed  effect,  and  consider 
IV  estimation  based  on  the  first  differences  model,  yit  -  yi  t-\  =  P(xit  -  X/,,_i)  + 
y(wit-  wu_i)  +  (uit  -  Ui  f_i ). 

22-4  Consider  the  differences  in  differences  (DID)  estimator  presented  in  Sec¬ 
tion  22.6.  Suppose  the  time  trend  term  (<5(  -  <5f_i)  differs  across  the  treated  and 
untreated  groups. 

(a)  Will  the  DID  estimator  of  0  based  on  repeated  cross-section  data  be  con¬ 
sistent?  Explain  your  answer. 

(b)  Is  consistent  estimation  of  0  possible  if  panel  data  are  available?  Explain 
your  answer. 

22-5  Using  the  hours  and  wages  data  of  Ziliak  (1997)  reproduce  as  much  of  Ta¬ 
ble  22.2  as  you  can,  with  appropriate  discussion,  when  the  instrument  set  is 
expanded  to  include  the  third  lags  of  Inwg,  kids,  age,  agesq,  and  disab  and  the 
seven  years  1982-88  are  used  to  estimate  (22.22). 
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Nonlinear  Panel  Models 


23.1.  Introduction 

This  chapter  extends  the  linear  model  panel  data  methods  of  Chapters  21  and  22  to  the 
nonlinear  regression  models  presented  in  Chapters  14-20.  We  focus  on  short  panels 
and  models  with  a  time-invariant  individual-specific  effect  that  may  be  fixed  or  may 
be  random.  Both  static  and  dynamic  models  are  considered. 

There  is  no  one-size-fits-all  prescription  for  nonlinear  models  with  individual  spe¬ 
cific  effects.  If  individual-specific  effects  are  fixed  and  the  panel  is  short  then  consistent 
estimation  of  the  slope  parameters  is  possible  for  only  a  subset  of  nonlinear  models. 
If  individual-specific  effects  are  instead  purely  random  then  consistent  estimation  is 
possible  for  a  wide  range  of  models. 

Section  23.2  presents  general  approaches  that  may  or  may  not  be  implementable  for 
particular  models.  Section  23.3  provides  an  application  to  a  nonlinear  model  with  mul¬ 
tiplicative  individual-specific  effects.  Specializations  to  the  leading  classes  of  nonlin¬ 
ear  models  -  discrete  data,  selection  models,  transition  data,  and  count  data  -  are  pre¬ 
sented  in  Sections  23.4-23.7.  Semiparametric  estimation  is  surveyed  in  Section  23.8. 


23.2.  General  Results 

General  approaches  to  extending  the  methods  for  linear  models  are  presented  in  this 
section.  We  first  present  the  various  models  -  fixed  effects,  random  effects,  and  pooled 
models,  distinguishing  parametric  from  conditional  mean  models.  Methods  to  estimate 
these  models  and  obtain  panel-robust  standard  errors  are  then  presented.  Further  details 
for  specific  nonlinear  panel  models  are  provided  in  subsequent  sections. 

23.2.1.  Individual- Specific  Effects  Models 

The  linear  individual-specific  effects  model  (see  Section  21.2.1)  specifies  that  the 
dependent  variable  yn  depends  on  a  time-invariant  individual-specific  effect  or,-,  as 
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well  as  the  usual  regressors  xit  and  regression  parameters  (3.  The  model  is  written  as 
yit  =  aj  +  x'(/3  +  Ujt,  where  is  an  error  term. 

For  nonlinear  models  such  as  logit  and  Poisson  models  there  is  less  motivation 
for  introducing  an  additive  error  Instead,  it  is  more  natural  to  directly  model  the 
conditional  density,  or  the  conditional  mean,  which  in  the  linear  case  can  be  expressed 
as  E[.yl7|a,-,x,-f]  =  at  +x'itf3. 


Parametric  Models 

A  fully  parametric  approach  is  common  for  many  nonlinear  models,  most  notably 
models  for  binary,  multinomial,  and  censored  outcomes  given  in  Chapters  14-16. 

The  standard  cross-section  models  are  single-index  models,  or  single-index  models 
with  additional  scale  parameter(s).  The  parametric  individual-specific  effects  mod¬ 
els  presented  in  subsequent  sections  specify  conditional  density 

f(yit\oih  X;r)  =  f(yit, «/  +  x'itf3,  7),  (23.1) 

where  7  denotes  additional  parameters  such  as  variance  parameters.  The  model  is  a 
single-index  model  in  the  regressors  xit  and  the  individual  effects  a, . 

The  usual  assumption  is  that  yir|x(,,  o',  is  independent  over  both  i  and  t.  This  can 
be  relaxed  to  permit  dependence  over  t  for  given  i  (see  Section  23.2.6). 


Conditional  Mean  Models 

A  quite  general  nonlinear  model  for  the  conditional  mean,  with  unobserved  time- 
invariant  individual-specific  effect,  is 

E[y,r|«M x(-,]  =  g(oii,  Xu,  P),  i  =  l, ...  ,N,  t  =  ,T,  (23.2) 

for  given  function  g(-).  Three  common  specifications  are  an  additive  individual- 
specific  effects  model 

g(oti,  X/t ,  P)  =  oii  +  g(xjt ,  P),  (23.3) 

a  multiplicative  individual-specific  effects  model, 

g{0Ci,xit,  P)  =  (XigiXj,.  P),  (23.4) 

and  a  single-index  individual-specific  effects  model 

g(a,,  x„,  P)  =  g(at  +  x',/3).  (23.5) 

In  each  case  the  function  g(-)  is  specified.  The  regressors  xit  may  be  time  varying  or 
time-invariant  and  may  include  a  time  dummy. 

The  additive  effects  model  is  suited  to  applications  where  the  range  of  yit  is 
unbounded,  as  implicitly  assumed  with  linear  regression.  The  multiplicative  effects 
model  is  suited  to  applications  where  y,-f  is  nonnegative  unbounded,  such  as  count  data, 
in  which  case  a,  >  0  and  g(-)  >  0.  The  single-index  model  is  a  natural  starting  point 
for  the  probit  model,  for  example,  with  g(a,-  +  x'(/3)  =  Ola,  +  x-,/3),  where  <!>(•)  is  the 
standard  normal  cdf.  The  single-index  model  reduces  to  the  additive  model  if  g(-)  is 
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the  identity  function.  It  reduces  to  the  multiplicative  model  if  g(-)  is  the  exponential 
function,  since  then  cxpfo',  +  x-f/3)  =  exp  (a,-)  exp(xJ;/3). 

The  moment  condition  (23.2)  conditions  only  on  current  period  xit  and  assumes  that 
regressors  are  contemporaneously  exogenous  (see  Section  22.2.4).  Elimination  of  the 
individual-specific  effects  a,  can  require  stronger  exogeneity  assumptions.  Regressors 
are  weakly  exogenous  if 

E[y;,k,  xn, . . . ,  x,,]  =  g(aj,xit,  (3)  (23.6) 

and  strongly  exogenous  or  strictly  exogenous  if 

E[ynl«M  xn> . . . ,  xlT]  =  g(oii,  xit,  (3).  (23.7) 

A  nonlinear  model  with  additive  effects  adds  relatively  few  complications.  In 
particular,  if  the  panel  model  is  y,t  =  a,  +g(x!f,/3)  +  ult,  then  the  approaches  of 
Chapters  21  and  22  will  carry  through  with  some  modification,  including  estimation 
by  nonlinear  LS  and  IV  rather  than  linear  LS  and  IV. 

This  chapter  focuses  on  models  with  nonadditive  individual-specific  effects,  such  as 
in  (23.4)  and  (23.5).  These  effects  can  be  treated  as  fixed  effects  or  as  random  effects. 

23.2.2.  Fixed  Effects  Models 

A  fixed  effects  model  treats  the  individual-specific  effect  a,  as  an  unobserved  random 
variable  that  may  be  correlated  with  the  regressors  x,,.  In  short  panels  joint  estimation 
of  the  fixed  effects  a  i , . . . .  o'  y  and  the  other  model  parameters,  (3  and  possibly  7,  gen¬ 
erally  leads  to  inconsistent  estimation  of  all  parameters.  Instead,  a  variety  of  methods 
have  been  proposed  that  eliminate  the  fixed  effects  in  some  special  settings,  permitting 
consistent  estimation  of  the  other  model  parameters. 


The  Incidental  Parameters  Problem 

Neyman  and  Scott  (1948)  considered  inference  when  some  parameters  are  common 
to  all  observations  but  there  are  additionally  an  infinity  of  parameters,  each  of  which 
depends  on  only  a  finite  number  of  observations.  The  common  parameters  are  of 
intrinsic  interest,  whereas  the  latter  parameters  are  called  incidental  parameters. 

Here  (3  and  7  are  common  parameters,  but  oq , . . . .  tf  V  are  incidental  parameters  if 
the  panel  is  short  as  then  each  a,  depends  on  fixed  T  observations  and  there  are  in¬ 
finitely  many  a,  since  N  00.  The  incidental  parameters  are  inconsistently  estimated 
as  N  — »•  00,  since  only  T  observations  are  used  to  estimate  each  parameter.  The  inci¬ 
dental  parameters  problem  is  that  this  contaminates  the  estimation  of  the  common 
parameters.  In  general  the  common  parameters  are  also  inconsistently  estimated,  even 
though  they  are  finite  in  number  and  are  estimated  using  NT  00  observations. 

A  simple  illustration  of  contamination  by  incidental  parameters  is  to  suppose  that 
yu  ~  A f[oti,  or2].  Maximum  likelihood  estimation  yields  a,  =  yt,  i  =  1, . . . ,  N,  and 
er2  =  (NT)~l  '}2l(yu  —  y,)2.  Then  E[ct2]  =  a2(T  —  1  )/T,  so  cf2  is  inconsistent 
for  or2  as  N  — >  00  in  the  short  panel  setting  of  fixed  T .  This  inconsistency  can  be 
very  large,  with  a  — >  0.5or-  when  T  =2. 
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In  general  if  there  is  an  incidental  parameters  problem,  alternative  estimation  meth¬ 
ods  are  needed  that  first  eliminate  the  incidental  parameters.  For  some  popular  models, 
most  notably  the  panel  probit  model,  there  is  no  solution  to  the  incidental  parameters 
problem.  Even  where  methods  exist  to  consistently  estimate  /3  these  methods  tend 
to  be  model  specific,  as  emphasized  by  Lancaster  (2000).  No  unified  solution  to  the 
incidental  parameters  problem  exists. 


Conditional  Likelihood 

A  statistic  /  is  called  sufficient  for  a  parameter  0  if  the  distribution  of  the  sample  given 
t  does  not  depend  on  6.  For  individual-specific  effects  panel  models,  if  a  sufficient 
statistic  exists  for  the  nuisance  parameter  a,  then  by  conditioning  on  this  sufficient 
statistic  the  nuisance  parameter  is  eliminated.  The  resulting  conditional  density  de¬ 
pends  only  on  the  common  parameters,  permitting  consistent  estimation. 

Let  y =  [  y,  i .... ,  y,-/  T  be  a  T  x  1  vector  dependent  variable  for  individual  i  over 
all  T  time  periods,  and  let  X,  =  [x,i,  ■  ■ . ,  x/r]'  denote  the  corresponding  T  x  K  ma¬ 
trix  of  regressors.  For  a  static  model  y,  has  density 

T 

/(y;|X;,  a,-,  (3,  7)  =  ]""[  /(>«lx<r,  P,  7)-  (23-8) 

t=i 

Maximum  likelihood  estimation  based  on  this  density  generally  leads  to  inconsistent 
estimation  of  / 3  in  short  panels  owing  to  the  incidental  parameters  problem. 

Suppose  there  exists  a  sufficient  statistic  s,  for  a, .  Then  conditioning  on  the  suffi¬ 
cient  statistic  Si,  in  addition  to  the  usual  conditioning  on  regressors,  leads  to  condi¬ 
tional  density 


/(y,-|Xi,  «i,  p,  7,  8,0  =  /(y,  |Xi,  p,  7,  Si),  (23.9) 

so  that  a,  has  dropped  out.  For  example,  for  the  linear  regression  model  under  nor¬ 
mality  s,  =  fi  (see  Section  21.6.3).  Then  the  conditional  MLE  maximizes  the  condi¬ 
tional  log-likelihood 

N 

1iiLcond(/3,  7)  =  E  ln/(yi|Xi,  p,  7,  s,).  (23.10) 

i= 1 

The  adjective  conditional  is  added  here  to  indicate  conditioning  on  s,  and  not  just  X,  . 

Andersen  (1970)  provided  a  detailed  analysis  of  the  conditional  MLE.  He  showed 
that  the  conditional  MLE  is  consistent  if  the  density  /(y,  |X(  ,  a,  ,  /3)  is  correctly  spec¬ 
ified,  that  the  information  matrix  equality  holds  for  the  conditional  log-likelihood,  but 
in  general  there  is  a  loss  of  efficiency  as  the  conditional  MLE  need  not  attain  the 
Cramer-Rao  lower  bound.  For  normal  and  Poisson  distributions,  however,  there  is  no 
efficiency  less. 

The  approach  requires  that  a  suitable  sufficient  statistic  exists.  This  is  the  case  for 
only  a  few  models,  essentially  those  of  the  linear  exponential  family.  Andersen  focused 
on  models  without  regressors  and  gave  as  examples  the  normal,  Poisson,  binomial, 
and  gamma.  Once  regressors  are  introduced  it  becomes  even  more  difficult  to  find 
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a  suitable  sufficient  statistic.  McCullagh  and  Nelder  (1989)  provide  a  quite  general 
discussion  and  Diggle  et  al.  (2002)  restrict  their  attention  to  specialized  GLMs  with 
canonical  link  functions. 

The  leading  examples  when  a  sufficient  statistic  is  available  are  linear  models  un¬ 
der  normality  (see  Section  21.6.2),  logit  models  (though  not  probit  models)  for  binary 
data  (see  Section  23.4.3),  one-parameter  gamma  (including  exponential),  and  particu¬ 
lar  parameterizations  of  the  Poisson  and  negative  binomial  models  for  count  data  (see 
Section  23.7.3). 


Mean-Differenced  Transformation 

For  some  models  of  the  conditional  mean  with  additive  or  multiplicative  effects,  the 
individual  effects  a,  can  instead  be  eliminated  by  use  of  an  appropriate  differencing 
transformation.  This  leads  to  moment  conditions  that  can  be  used  for  method  of  mo¬ 
ments  or  GMM  estimation  as  detailed  in  Section  23.2.6. 

The  mean-differenced  transformation  generalizes  the  within  transformation  for 
the  linear  model  given  in  Section  21.2.2  that  eliminates  a,  by  subtracting  individual- 
specific  means.  It  requires  strongly  exogenous  regressors,  see  (23.7). 

For  the  additive  effects  model  defined  in  (23.3)  with  strongly  exogenous  regressors 


E[(y,7  -  yd  -  (g{x'it(3)-gi((3))\xn, xiT]  =  0, 


(23.11) 


where  g,(/3)  =  T  1  Y,]=\  g(x'i,P)  and  the  result  uses  E[y;|xa, . . . ,  x,T]  =  +  g,(/3). 

For  linear  models  (23.11)  simplifies  considerably  as  then  g{x!jt(3)  —  gi(/3)  =  (xit  — 

For  the  multiplicative  effects  model  defined  in  (23.4),  some  algebra  leads  to 


E 


g(x'itP) 

g,(J3) 


x  y,|x,i, . . . ,  x,7- 


=  0, 


(23.12) 


using  E[y,|x,i, . . .  ,  x,/-]  =  Q!,g;(/3).  For  simplicity  we  call  this  a  mean-differenced 
transformation,  though  strictly  speaking  it  is  a  quasi-difference.  It  is  also  called  a 
(conditional)  mean-scaling  transformation,  as  equivalently 


yn  -  - 


yi 


gi((3 ) 


g{x'jl(3)\xn,...,xiT 


=  0. 


First-Differences  Transformation 

The  first-differences  transformation  generalizes  the  first-difference  transformation 
for  the  linear  model  given  in  Section  21.2.2  that  eliminates  a,  by  subtracting  the  model 
lagged  one  period.  We  assume  regressors  are  weakly  exogenous  (see  (23.6)). 

For  the  additive  effects  model, 

Wy,t  -  yut- 1)  -  (g(x!itP)-g(x!i  t_lf3)) |xn, . . . ,  x;, f_t]  =  0,  (23.13) 

where  we  have  used  E[_y,>(_|  |x,, , . . . ,  x^]  =  a,-  +  g(xj  r_1/3). 
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For  the  multiplicative  effects  model  defined  in  (23.4), 


yn  - 


g(x;f/3) 

gKt- \P) 


=  0, 


(23.14) 


where  we  have  used  E[y;,_i  |x,-| , . . . ,  =  a,-g(X/  r_1/3).  For  simplicity  we  call  it 

a  first-differences  transformation,  though  strictly  speaking  it  is  a  quasi-difference. 

The  first-differences  transformation  relies  on  weaker  assumptions,  conditioning 
only  up  to  period  t.  It  permits  estimation  of  dynamic  models,  extending  Section  22.5 
to  nonlinear  models.  For  dynamic  multiplicative  effects  Wooldridge  (1997)  and  Cham¬ 
berlain  (1992)  actually  proposed  use  of  a  variant  of  (23.14), 


§Kt- iff). 


-yu  -  yi,t- 1 1 X,- 1 , 


,  X;  i 


=  0. 


(23.15) 


Dummy  Variable  Model  Estimation 

If  the  incidental  parameters  problem  is  ignored,  one  can  attempt  to  estimate  all  pa¬ 
rameters,  including  the  individual-specific  effects.  Introduce  a  set  of  N  dummy  vari¬ 
ables  djj,  equal  to  1  if  i  =  j  and  equal  to  0  otherwise,  and  then  jointly  estimate  the 
individual-specific  parameters  cri, . . . .  cry  along  with  the  other  model  parameters. 

This  estimator  is  computationally  feasible,  despite  the  very  large  number  of  param¬ 
eters  owing  to  large  N,  but  the  resulting  estimates  of  (3  and  possibly  7  are  in  general 
inconsistent.  Here  we  consider  just  parametric  models,  though  similar  points  hold  for 
conditional  mean  models. 

Thus  consider  the  parametric  individual-specific  effects  model  defined  in  (23.1). 
Then  the  method  is  to  obtain  ML  estimates  of  (3,  7,  and  a  =  [a\  ...  uN  ]'  that  maxi¬ 
mize  the  full  log-likelihood 

N  T 

111  Lfe(/T  7,  a)  —  EEln  f  (yi,,d'ita  +  x'ilf3,j) ,  (23.16) 

i=  1  t=  1 

where  djt  =  [d\jt . . .  The  first-order  conditions  with  respect  to  6  =  [f3'  j']'  and 

ex  are 

N  T 

E  E 9  ln  f  (yit’  d«a + x^’ 7)  / 9 6  =  °> 

i=  1  t=  1 
T 

^  9  In  /  (yit,  Oii  +  x'itf3,  7)  /dat  =0,  i  =  1, . . . ,  N. 
t=  1 

This  estimator  can  be  simple  to  compute  despite  the  large  number  of  parameters,  N 
plus  the  dimension  of  6.  As  detailed  in  Greene  (2004b),  the  inverse  of  the  Hessian 
is  easily  obtained  by  partitioning  into  6  and  a  and  applying  the  standard  partitioned 
inverse  formula,  using  the  simplification  that  31nL(6,  a)/doiidaj  =  0  for  j  i  so 
that  the  inverse  of  the  large  N  x  N  block  corresponding  to  ( a ,  cc)  is  trivially  obtained. 

In  two  special  cases  there  is  no  incidental  parameters  problem.  First,  if  yu  ~ 
Af[ctj  +  x';/3,  a2]  then,  from  Section  21.6.4,  the  MLE  for  (3  is  the  within  estimator, 
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which  is  consistent  for  (3  even  for  finite  T.  Here  the  incidental  parameters  problem 
arises  for  estimation  of  a2  but  not  for  /3.  Second,  for  yit  ~  V\cxp(al  +  x'r/3)]  there  is 
similarly  no  incidental  parameters  problem  in  estimating  (3  (see  Section  23.7.3). 

In  general,  however,  there  is  an  incidental  parameters  problem.  The  derivative  with 
respect  to  a,  involves  only  T  observations,  rather  than  all  NT  observations.  This  usu¬ 
ally  spills  over  to  inconsistency  of  /3ML  and  7ML  in  short  panels.  It  is  possible  that  this 
inconsistency  is  moderate  in  panels  that  are  not  too  short,  such  as  T  =  10  or  T  =  20. 
The  simulation  study  of  Greene  (2004a)  indicates  that  the  nature  and  extent  of  bias 
vary  considerably  with  the  particular  nonlinear  model  being  studied.  The  development 
of  methods  that  are  reasonably  robust  in  the  presence  of  fixed  effects,  though  still  in¬ 
consistent  in  short  panels,  is  an  active  area  of  research. 


23.2.3.  Random  Effects  Models 

A  random  effects  model  treats  the  individual-specific  effect  a,  as  a  random  variable 
with  specified  distribution  and  eliminates  a,  by  integrating  over  this  distribution.  Ran¬ 
dom  effects  are  usually  applied  to  parametric  models. 


Parametric  Models 


Suppose  the  /th  observation  y,  has  unconditional  joint  density  /(y,  |X, ,  or,- ,  /3 , 7)  given 
in  (23.8),  and  the  random  effect  has  density 


«,  ~  g(oCi\r]), 


(23.17) 


where  g(a,  1 77)  does  not  depend  on  observables.  Then  the  unconditional  joint  density 
for  the  /th  observation  is 


/(y;|X;,  /3,  7,  r)) 


f(yit\xit,  Oli,  f3,  7) 


g(aj\r})dotj, 


(23.18) 


where  by  unconditional  we  mean  we  no  longer  condition  on  a, .  The  random  effects 
MLE  of  (3,  7,  and  r/  maximizes  the  log-likelihood 


In  LrfX/3,  7,  T7)  =  EM /In 

i= 1  V  L?=! 


g(ai  \'))doii 


(23.19) 


In  some  cases  an  analytical  expression  for  this  integral  is  possible,  basically  if 
n,  fiyii  I  o'/)  and  g(otj )  are  conjugate  pairs  (see  Table  13.2).  Examples  include  normal- 
normal  for  linear  regression,  which  yields  normal,  and  Poisson-gamma  for  count  data 
regression,  which  yields  negative  binomial. 

In  most  cases  analytical  results  are  not  available,  but  numerical  methods  or 
simulation-based  methods  are  likely  to  work  well  because  the  integral  is  only  one 
dimensional.  The  usual  approach  is  to  choose  /(y,-,)  to  be  the  density  that  is  thought 
to  best  fit  the  data  in  the  absence  of  individual  effects,  and  to  let  g(ai)  be  the  normal 
density.  The  integral  is  then  a  univariate  integral  with  respect  to  a  normal  random  vari¬ 
able.  For  small  T  the  integral  can  be  well  approximated  by  Gauss-Hermite  quadrature 
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(see  Section  12.3.1),  which  approximates  the  integral  with  respect  to  a  normal  den¬ 
sity  by  a  weighted  sum.  Butler  and  Moffitt  (1982)  provide  a  detailed  exposition  for 
the  random  effects  probit  model.  Skrondal  and  Rabe-Hasketh  (2004)  use  quadrature. 
Alternatively,  repeated  draws  from  glcr, )  can  be  the  basis  for  maximum  simulated  like¬ 
lihood  estimation  (see  Section  12.4.2). 

The  preceding  discussion  assumed  independence  over  t  for  given  i.  If  instead  yit 
and  yis  are  correlated  over  i  then  it  is  more  efficient  to  replace  Ur  f(yit\xit,cci,p,  7) 
by  / (y,  |X,- ,  a,- ,  /3 , 7)  in  (23.18)  and  (23.19). 

Random  Coefficients  Model 

The  random  effects  approach  can  clearly  be  generalized  to  a  random  coefficients 
model,  with  random  slopes  as  well  as  random  intercepts,  similar  to  the  linear  case  in 
Section  22.8. 

The  natural  model  is  a  single-index  model  with  conditional  density 
/(>’((.  x;,(/3  +  otj),  7)  or  conditional  mean  g(yu ,  x'((/3  +  a,))  and  the  univariate 
integral  with  respect  to  scalar  a,  will  become  a  multivariate  integral  with  respect  to 
vector  a,-,  usually  assumed  to  be  normally  distributed. 


Correlated  Random  Effects  Model 

The  key  weakness  of  the  random  effects  model  is  that  it  makes  the  strong  assump¬ 
tion  that  the  random  effects  are  independent  of  regressors.  To  overcome  this  limitation 
Chamberlain  (1980,  1982)  proposed  a  correlated  random  effects  model,  for  back¬ 
ground  discussion  see  Section  21.4.4,  that  specifies 

a;  =  x'^  TT,  H - b  x'TiTTT  +  ■  (23.20) 

The  likelihood  above  is  then  maximized  with  respect  to  (3,  7,  tv,  and  the  parameters  of 
the  density  of  §.  Unlike  linear  models  this  model  leads  to  different  estimator  than  that 
obtained  using  the  simpler  specification  of  Mundlak  (1978)  that 

a,  =  x'n  +  (23.21) 

The  equation  (23.20)  can  be  viewed  as  an  example  of  a  hierarchical  model.  More 
general  hierarchical  models  also  permit  random  slopes,  with  estimation  by  classical  or 
Bayesian  methods.  Section  22.8  presented  details  for  the  linear  model. 


Finite  Mixture  Model 


The  finite  mixture  model  (see  Section  18.5.1)  provides  an  alternative  model  for  the  un¬ 
observed  individual-specific  effect.  If  there  are  m  latent  classes  or  types  of  individuals 
and  for  the  / th  type  a,  =  otj  then  (23.18)  becomes 


f(yi\Xi,(3, 7,  tt)  =  ^2 


.7=1  L'=i 


n  f(yu 


i/3,7) 


TXi 


This  model  is  most  often  used  for  panel  duration  models  (see  Section  18.5.2). 


786 


23.2.  GENERAL  RESULTS 


23.2.4.  Pooled  Models 

The  pooled  model  does  not  explicitly  model  individual-specific  effects.  It  extends  lin¬ 
ear  pooled  regression  (see  Section  21.5)  to  nonlinear  models. 


Conditional  Mean  Models 
For  conditional  mean  models  the  pooled  model  is 

E[yi,|xI-,]  =  g(x,o/3),  (23.22) 

for  specified  function  g(-). 

The  model  (23.22)  can  be  directly  estimated  by  NLS,  with  inference  based  on  panel- 
robust  standard  errors  that  control  for  conditional  heteroskedasticity  and  for  condi¬ 
tional  correlation  between  yit  and  yis.  More  efficient  estimation  is  possible  by  model¬ 
ing  the  heteroskedasticity  and  correlation.  Details  are  provided  in  Section  23.2.6. 


Pooled  versus  Random  Effects  Models 

What  is  the  cost  of  ignoring  individual-specific  random  effects? 

The  additive  effect  model  E[  v,t  |of,- ,  x,-,]  =  a,-  +  g(x,-t,  j3)  leads  to  (23.22)  if 
E[o?/  |x,-r]  =  0.  The  multiplicative  effect  model  E[  y,,  x(/  ]  =  a,g(x,f)  ft)  implies 

(23.22)  if  E[cc,  |x(,  ]  =  1.  So  the  pooled  model  will  lead  to  consistent  estimation  of  ft 
in  a  random  effects  model  if  the  effects  are  additive  or  multiplicative  and  the  standard 
normalizations  of  the  mean  of  a,  for  these  models  are  used. 

Otherwise,  the  pooled  model  is  unlikely  to  lead  to  the  same  parameter  estimates  as 
an  individual-specific  random  effects  model.  For  example,  consider  a  probit  random 
effects  model  with  ~E[yit\ai,  xir]  =  +  x,//3),  where  a,-  ~  N[ 0,  rr; |.  Then  it  can 

be  shown  that  E[y,-f|x,-f]  =  4>(x,//3/^/I  +  cr~),  which  differs  from  the  natural  pooled 
probit  model  E[y;;  |x,-(  ]  =  <P(x,/ft).  Unlike  the  linear  model  of  Chapter  21,  if  the  true 
model  has  individual-specific  random  effects  than  ignoring  these  effects  can  lead  to 
inconsistent  parameter  estimates  of  ft. 

The  statistics  literature  uses  the  pooled  model  approach  extensively  for  panel 
versions  of  generalized  linear  models,  such  as  binary  data  and  count  data.  The  re¬ 
sulting  parameter  estimates  are  called  population  averaged,  as  the  random  effects  are 
averaged  out.  The  approach  is  called  marginal  analysis,  as  E[y,f|x,,]  is  a  model  that 
is  marginal  with  respect  to  the  random  effects. 


Parametric  Models 

For  pooled  parametric  models  the  starting  point  is  usually 

f{yiMu)  =  x'iAi)  (23-23) 

for  specified  density  /(•).  This  model  can  be  directly  estimated  by  ML,  with  inference 
based  on  panel-robust  standard  errors  that  control  for  conditional  heteroskedasticity 
and  correlation  (see  Section  23.2.6). 
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In  general  the  pooled  parametric  model  estimates  of  / 3  and  7  are  unlikely  to  be 
consistent  with  those  from  a  random  effects  parametric  model.  The  arguments  are 
similar  to  those  for  the  conditional  mean. 


23.2.5.  Fixed  versus  Random  Effects 

The  essential  result  that  random  effects  and  pooled  model  estimators  are  inconsistent 
if  individual-specific  effects  are  present  and  are  correlated  with  regressors  still  holds 
in  nonlinear  models.  This  favors  use  of  fixed  effects  models  on  grounds  of  robustness, 
though  there  is  a  trade-off  with  loss  of  efficiency  in  estimation.  A  Hausman  test  can 
be  used  (see  Section  21.4.4)  to  test  whether  a  fixed  effects  model  is  needed,  provided 
consistent  estimation  of  the  fixed  effects  model  is  possible. 

Other  comparisons  of  fixed  versus  random  effects  models  for  linear  models  (see 
Section  21.4)  require  some  adaptation  for  nonlinear  models. 

Because  of  the  incidental  parameters  problem,  not  all  nonlinear  models  with  fixed 
effects  admit  consistent  parameter  estimates.  So  fixed  effects  modeling  is  not  always 
feasible. 

If  consistent  estimation  of  a  nonlinear  fixed  effects  model  is  possible  then,  unlike 
the  linear  case,  the  coefficients  of  time-invariant  regressors  can  be  identified.  To  see 
this  consider  the  mean-differenced  transformation  for  an  additive  effects  model.  For 
a  linear  model  E [(yit  -  yt)  —  (x,r  —  x,y/3|x,i, .  ■ . ,  x,7  ]  =  0,  with  obvious  problems 
for  time-invariant  regressors  as  then,  considering  the  jth  regressor,  x,(/  —  x(/  =  xy  — 
Xij  =  0.  More  generally,  from  (23.1 1) 

E [(yit  -  yd  -  (gtfitP)-gi(J3)) |x,i,  . . . ,  x,T]  =  0, 

with  no  such  simplification  for  nonlinear  g(-)  unless  all  K  components  of  x(/  are  time- 
invariant. 

In  fixed  effect  models  with  nonadditive  effects  it  is  not  possible  to  predict  changes 
in  the  dependent  variable  as  regressors  change.  For  the  general  model  (23.2),  the 
marginal  effect  9  E[y,,|x,,,  or,-,  /3\/dxit  =  dg(xit,  an,  f3)/dxit  depends  on  a 

The  marginal  effect  can  be  measured  in  two  special  cases.  For  additive  ef¬ 
fects  (see  (23.3))  the  marginal  effect  is  dg(Xjt,  /3)/9x!?,  which  does  not  de¬ 
pend  on  a,-.  For  multiplicative  effects  models  (see  (23.4))  the  marginal  effect  is 
a,-  dg(xit,  /3)/dXjt.  Then  it  is  possible  to  measure  the  relative  size  of  marginal  effects  for 
changes  in  different  regressors.  In  particular,  if  E[y(f[x,f,  a, ,  /3]  =  a,exp(x-r/3),  then 
{dE[yit]/dxitj)/(dE[yit]/dxitk)  =  Pj/Pk. 


23.2.6.  Estimation  and  Panel-Robust  Statistical  Inference 

The  preceding  analysis  has  concentrated  on  elimination  of  the  incidental  parameter 
o';.  Now  we  detail  estimation  of  model  parameters,  once  a,  has  been  eliminated  for 
models  with  individual-specific  effects. 

We  assume  a  short  panel  and  independence  of  observations  over  i.  The  dependent 
variable  y,-f  may  be  conditionally  heteroskedastic  and  conditionally  correlated  over  t 
for  given  i.  The  situation  is  similar  to  that  in  Section  21.2.3,  except  that  nonlinear 
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estimators  are  used  instead  of  simpler  linear  LS  estimators.  Standard  statistical  output 
that  ignores  this  complication  will  lead  to  invalid  inference.  In  the  following  we  present 
expressions  for  panel-robust  estimates  of  the  variance  matrix  of  parameter  estimates. 
Alternatively,  a  panel  bootstrap  can  be  used  (see  Section  1 1.6.2). 


GMM  Estimation 

Panel  GMM  estimation  is  appropriate  for  models  based  on  the  conditional  mean.  The 
key  is  specification  of  the  moment  condition  that  is  the  basis  of  GMM  estimation. 
Following  Section  22.2.1,  a  natural  starting  point  is 


E[Z;U,(0)]  =  0,  i  =  1 


(23.24) 


where  Z,  is  a  T  x  r  matrix  that  depends  on  the  regressors,  u,(0)  is  a  T  x  1  residual 
vector,  and  0  is  a  q  x  1  parameter  vector  0.  Different  panel  models  lead  to  different 
specifications  of  u,  and  Z,  .  An  example  is  given  in  the  following.  A  key  departure 
from  Chapter  22  is  that  the  residual  u,(0)  will  be  nonlinear  in  6. 

If  r  =  q  then  there  are  as  many  moment  conditions  as  parameters  to  estimate  and 
we  can  use  the  panel  method  of  moments  estimator  0mm  that  solves 


-£z;»,w  =  o. 


i= 1 


(23.25) 


Using  results  in  Section  6.10.3  on  nonlinear  systems  estimation,  we  have  that  this 
estimator  is  asymptotically  normal  with  variance  matrix  consistently  estimated  by 


V[0]  = 


-i-l 


£d;.z, 


i= 1 


Ez;«^z  ,• 


i=i 


-i-i 


E7-5 


i=i 


(23.26) 


where  D,  =  3u(/30,|g  and  u,  =  u,(0).  This  yields  panel-robust-standard  errors  in 
short  panels. 

If  r  >  q  then  GMM  estimation  is  necessary,  and  we  use  the  panel  GMM  estimator 
0gmm  that  minimizes 


Qn(0) 


^Ez;u,(0)1  wJi£z;u,.(0) 

i—  1  1=1 


(23.27) 


where  is  an  r  x  r  weighting  matrix.  The  asymptotic  variance  matrix  for  this  es¬ 
timator  can  be  obtained  directly  from  results  for  the  nonlinear  systems  IV  estimator 
given  in  Section  6.10.4.  Given  the  moment  condition  (23.24),  the  most  efficient  esti¬ 
mator  uses  WN  =  [A-1  J2i  Z-U/u'Z,]-1. 

More  efficient  estimators  are  possible  using  alternative  moment  conditions.  In  par¬ 
ticular,  if  the  starting  point  is  a  particular  conditional  moment  condition  then  the  op¬ 
timal  unconditional  moment  condition  for  GMM  estimation  is  given  in  Section  6.3.7. 
The  GEE  estimator  given  later  follows  this  approach.  A  more  general  treatment  is 
given  in  Avery,  Hansen,  and  Hotz  (1983)  and  Breitung  and  Lechner  (1999). 


789 


NONLINEAR  PANEL  MODELS 


GMM  Example 

As  a  specific  example,  consider  the  first-differences  transformation  applied  to  the  mul¬ 
tiplicative  fixed  effects  model.  The  starting  point  is  the  conditional  moment  restriction 
(23.14).  This  leads  to  many  unconditional  moment  conditions,  one  of  which  is 


E 


l/3) 


x  yut- 1 


=  o, 


=  i  =  1,...,  N. 


Assume  data  on  {yu ,  xlt)  are  available  for  (T  +  1)  periods,  with  the  initial 
period  then  lost  because  of  first  differencing.  Stacking  over  T  time  periods 
yields  (23.24)  with  Z ■  =  [xa, . .  - ,  x(- T ]  and  u.  =  [na, . . . ,  where  uit  =  yit  — 
[g(x'itP)/g(x'it_lp)]yu_l.UeK  Z'u,-  =  Y2,  xituit>  so  the  method  of  moments  estima¬ 
tor  p  solves 


EE*' 

i= 1  t=\ 


/  /  /Q\  yiJ  1 


=  0. 


Clearly,  additional  moment  conditions  can  be  used,  such  as  E[x,r_in,r]  =  0,  leading  to 
an  overidentified  model  and  estimation  by  GMM.  This  was  discussed  extensively  for 
the  linear  model  in  Section  22.2. 


Generalized  Estimating  Equations  Estimation 

The  pooled  model  for  the  conditional  mean  specifies  E[yif  |xir]  =  g(xit,  (3)  (see  Sec¬ 
tion  23.2.4).  This  model  can  be  estimated  by  GMM  methods  already  given.  Here  we 
go  further  and  consider  efficient  GMM  estimation. 

Stacking  over  all  T  observations  gives  conditional  moment  condition 


E[y,  -  g,  (/3)|X,]  =  0, 


(23.28) 


where  g,(/3)  =  [g(xn,  P),  ■■■,  g(x, iT,  P)]'  and  X,-  =  [xa, . . . ,  xiT]'.  The  optimal  un¬ 
conditional  moment  condition  to  use  in  estimation  is  then 


E 


~9g  ;•(£) 

dp 


{V  [ydX/]}-1  (y,-  —  giiP)) 


=  0, 


(23.29) 


a  result  obtained  by  applying  the  general  result  given  in  Section  6.3.7.  This  leads  to 

the  generalized  estimating  equations  estimator  /3GEE  that  solves 


yp9g;(/3)v- 

h^r  1 


(y i  -  g i(P))  =  o, 


(23.30) 


where  S,  is  a  working  variance  matrix  for  V[y,  |X,].  The  asymptotic  variance  matrix 
of  Pqee  is  given  bY  (23.26)  with  u,-  =  y,-  -  g,(/3)  and  Z ,■  =  dg,l(P)/dP\^  x  S,-.  This 
variance  estimate  is  panel-robust  and  is  also  robust  to  misspecification  of  £,  . 

The  GEE  estimator,  due  to  Liang  and  Zeger  (1986),  is  widely  used  in  the  statistics 
literature  for  panel  versions  of  generalized  linear  models.  Different  GLMs  correspond 
to  different  conditional  mean  functions  g,(/3)  and  working  variance  matrices  £,  . 
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ML  Estimation 

For  likelihood-based  models  the  starting  point  is  the  joint  density  for  all  T  individuals, 
/(y,  |X,  ,  9).  For  pooled  parametric  models  9  =  [ ()' .  7']  (see  (23.23)),  and  for  random 
effects  parametric  models  9  =  [f3\  7',  rj']  (see  (23.18)). 

The  standard  approach  is  to  let  /(y,|X,-,  9)  =  ]~[r=i  /0,/flxir,  #)>  where 
fiyu  |x,r ,  9)  is  the  density  for  the  (i.  t  )th  observation.  The  implicit  assumption  of  inde¬ 
pendence  over  t  for  given  i  is  usually  unwarranted,  especially  for  pooled  models  that 
do  not  include  a  random  effect  that  permits  some  correlation  over  time.  Nonetheless, 
consistent  estimates  of  9  are  obtained  even  if  /(y,  |X,  ,  9)  is  misspecified,  provided 
f(yn\Xit,  9)  is  correctly  specified.  A  sandwich  form  should  then  be  used  for  the  es¬ 
timator  variance  matrix  to  ensure  panel-robust  standard  errors.  The  MLE  is  strictly  a 
quasi-MLE,  with  detailed  discussion  given  in  Section  5.7.5.  More  generally,  this  ap¬ 
proach  is  an  example  of  inference  with  clustered  data  (see  Section  24.5). 

More  efficient  estimation  is  possible  using  a  richer  model  for  /(y,  |X(  ,  9)  that  ac¬ 
commodates  correlation  over  time.  However,  nonnormal  multivariate  distributions  for 
y,  can  be  restrictive  or  difficult  to  work  with.  For  pooled  GLMs  the  GEE  estimator  can 
be  used  instead. 


23.2.7.  Dynamic  Models 

Dynamic  models  with  individual-specific  effects  are  of  considerable  interest  as  they  al¬ 
low  one  to  distinguish  between  true  state  dependence  and  spurious  dependence  caused 
by  unobserved  heterogeneity  (see  Section  22.5.1). 

For  nonlinear  models  it  is  not  always  obvious  how  to  include  lagged  dependent 
variables  as  regressors,  since  for  some  types  of  data  there  is  not  always  a  standard  pure 
time  series  model.  This  is  illustrated  in  Section  23.7.4  for  the  Poisson  model.  Once  an 
appropriate  specification  is  determined,  the  standard  fixed  effects  estimators  become 
inconsistent  and  random  effects  estiamtors  need  to  incorporate  initial  conditions,  as 
was  the  case  for  the  linear  panel  model. 


Pooled  Models 

The  pooled  model  ignores  random  effects  and  estimates  the  usual  cross-section  model 
where  the  regressors  now  include  lagged  dependent  variables.  The  discussion  in  Sec¬ 
tion  23.2.4  is  again  relevant. 


Fixed  Effects  Models 

For  fixed  effects  models  the  issues  are  similar  to  those  presented  in  Section  22.5.  The 
regressors  are  now  weakly  exogenous  rather  than  strongly  exogenous.  The  usual  fixed 
effects  estimators  are  inconsistent. 

For  models  with  additive  effects  or  multiplicative  effects  consistent  estimation  is 
possible  using  the  first-difference  transformation  (see  Section  23.2.2)  and  higher  lags 
of  the  lagged  dependent  variable  as  an  instrument.  For  additive  effects  models  this 
leads  to  a  nonlinear  version  of  the  Arellano-Bond  estimator  given  in  Section  22.5.3. 
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For  multiplicative  effects  the  first-difference  transformation  is  detailed  in  Section 
23.7.4.  For  dynamic  logit  with  fixed  effects  see  Section  23.4.3. 


Parametric  Random  Effects  Models 


For  parametric  random  effects  models  initial  conditions  on  the  lagged  dependent  vari¬ 
able  matter.  Usually  there  is  no  satisfactory  treatment,  so  the  estimates  are  inconsistent 
in  short  panels  with  inconsistency  that  declines  as  T  gets  larger. 

Consider  the  simplest  case  where  only  the  first-period  lag  appears  in  the  model, 
so  the  regressors  x,r  become  regressors  x,(  and  y,/  —  i  •  The  random  effects  density 
(23.1)  becomes  f(yn \yu-\,  x,, ,  a, ,  fi)  for  t  =  2, ....  T .  However,  a  similar  model  for 
yi  |  cannot  be  included  because  _y,o  is  not  observed.  One  approach  treats  yn  as  ex¬ 
ogenous,  so  that  we  model  the  conditional  distribution  for  only  T  —  1  observations 
yit, ,  y’i2-  An  alternative  approach  presumes  a  static  model  for  yt \  that  depends  on 
regressors  x,  i  and  possibly  on  the  marginal  effect  a, .  Then  the  joint  conditional  density 
of  y i  is 


/(y(  |x,i - -  xiT,oii,6,  6ui) 


/ 


Y\  f(yit\yit-i,Xit, 8) 


1=2 


fiiyn  |x,i,  oij.  S^gioti^doii, 


rather  than  (23.18),  where  /i(ya|x;i,  a,-,  <5i)  is  the  assumed  density  for  the  first 
observation. 

In  pure  time  series  analysis  initial  conditions  become  irrelevant  asymptotically, 
since  T  — >  oo.  In  short  panels,  however,  they  become  very  important  as  T  is  small 
and  asymptotics  instead  use  N  oo. 


23.2.8.  Endogenous  Regressors 

The  treatment  for  endogenous  variables  in  nonlinear  models  is  similar  to  that  in  the 
linear  case  presented  in  Chapter  22. 

Panel  GMM  is  the  natural  framework.  The  starting  point  is  a  conditional  moment 
restriction  E[u,(0)|Z;]  =  0  for  appropriately  defined  residual  u,(0)  and  instruments 
Z,  .  This  leads  to  unconditional  moment  condition  (23.24)  that  is  the  basis  for  GMM 
estimation.  Possible  candidates  for  instruments  can  include  exogenous  regressors  from 
periods  other  than  the  current  one,  as  detailed  in  Sections  22.2  and  22.4  for  the  linear 
model. 


23.3.  Nonlinear  Panel  Example:  Patents  and  R&D 

We  model  the  relationship  between  patents  and  R&D  expenditures,  using  U.S.  data 
on  346  firms  for  each  of  the  five  years  1975-1979  from  Hall,  Griliches,  and  Hausman 
(1986).  The  dependent  variable  yn  is  Patents,  defined  as  the  number  of  patents  applied 
for  during  the  year  that  were  eventually  granted.  For  simplicity  we  consider  just  one 
explanatory  variable  x,-t,  real  R&D  spending  during  the  year  (in  1972  dollars). 
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Pooled  (overall)  regression 


Log  Patents 

Figure  23.1:  Patents  and  R&D  spending:  pooled  (overall)  regression.  Natural  logarithm  of 
patent  applications  leading  to  award  plotted  against  the  natural  logarithm  of  R&D  spending 
for  346  firms  in  each  of  the  five  years  1975-79.  Zero  patents  recoded  to  0.5  patents. 


An  obvious  starting  model  is  a  log-log  model,  with  E[ln  yIf|x!f]  =  a,  +  ft  lnx,r, 
since  then  ft  equals  the  Patents-R&D  elasticity.  This  model  cannot  be  applied  here,  as 
yit  =  0  for  a  considerable  number  of  observations  and  In  0  is  not  defined.  An  ad  hoc 
adjustment  is  to  recode  y-a  =  0  as  ylt  =  0.5  before  taking  logs. 

Figure  23. 1  provides  a  plot  of  the  adjusted  In  (Patents)  against  In  (R&D),  along  with 
fitted  OLS  (with  an  estimated  slope  coefficient  of  0.834)  and  nonparametric  regression 
curves,  using  data  for  all  firms  in  all  years.  Patents  clearly  increase  with  R&D  expen¬ 
diture.  Panel  data  analysis,  particularly  fixed  effects  models,  can  separate  this  rela¬ 
tionship  into  cross-section  and  time-series  components.  Note  that  Patents  vary  greatly 
across  observations,  particularly  across  firms,  with  a  mean  of  36.3,  a  standard  deviation 
of  74.5,  and  a  range  of  0  to  608  over  all  years  and  firms. 

We  estimate  a  multiplicative  individual-effects  model  for  the  conditional  mean  with 

E[y;,|x,-,,  aft  =  a,  exp(y6  Inx,,)  =  exp(y,  +  ft  lnx„),  (23.31) 

where  y,-  =  In Then  ft  directly  estimates  the  Patents-R&D  elasticity,  since  (23.31) 
implies  3  In  E[y(-;|x,;]/3  In x,-f  =  ft.  Unlike  the  log-log  model,  zero  values  for  ylt  cause 
no  problems. 

A  richer  parametric  model  recognizes  that  the  dependent  variable  is  a  count.  A 
starting  point  is  a  Poisson  model 

yiftxit,  Vi  ~  "P[exp(y;  +  ft  \nXjt)].  (23.32) 

This  model,  detailed  in  Section  23.7,  has  the  same  conditional  mean  for  yit  as  that 
given  in  (23.31). 

Table  23.1  presents  a  number  of  estimators  for  these  data.  All  estimators  are  con¬ 
sistent  under  the  assumption  that  the  conditional  mean  is  given  by  (23.31)  with  a,  a 
random  effect  that  is  independent  of  x,-f  and  has  constant  mean.  All  estimators  ex¬ 
cept  the  last  are  inconsistent  under  the  assumption  that  a,  is  instead  a  fixed  effect  that 
is  correlated  with  x,,.  Three  standard  error  estimates  are  provided:  program  defaults, 
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Table  23.1.  Patents  and  R&D  Spending:  Nonlinear  Panel  Model  Estimators1 3 


NLS 

Poisson 

GEE 

Poisson-RE 

Poisson-FE 

y  —  In  a 

2.529 

1.712 

2.068 

2.313 

— 

P 

.509 

.693 

.560 

.349 

-0.038 

Panel  se* 

(.055) 

(.043) 

(.033) 

(.033) 

(.033) 

Boot  se 

[.054] 

[.047] 

[-107] 

[-119] 

{.107} 

Usual  se 

{.011} 

{.002} 

{.004} 

{.033} 

{.033} 

Sum 

- 

.486 

.460 

.546 

.313 

N 

1730 

1730 

1730 

1730 

1620 

“  Shown  are  pooled  NLS,  pooled  Poisson,  pooled  GEE,  Poisson  Random  Effects  (RE),  and  Poisson  Fixed 
Effects  estimates  for  the  nonlinear  panel  regression  (23.31)  of  ln(Patents)  on  ln(R&D).  Standard  errors  for 
the  slope  coefficients  are  panel  robust  in  parentheses,  bootstrap  in  square  brackets,  and  usual  estimates  that 
assume  iid  errors  in  curly  braces.  The  second  to  last  row  gives  the  sum  of  ft  coefficients  in  an  expanded  model 
with  up  to  five  lags  of  ln(R&D)  as  regressors. 
b  se,  standard  error 


panel-robust  estimates  (where  available),  and  bootstrap  estimates  (without  refinement). 

The  details  for  each  column  are  as  follows: 

Pooled  NLS:  The  NLS  estimates  in  the  first  column  esdmate  (23.31)  with  o',  =  a  by 
NLS  (see  Section  5.8).  The  default  standard  error  of  0.011  assuming  iid  errors  is 
much  smaller  than  the  correct  panel-robust  standard  error  estimate  of  0.054. 

Pooled  Poisson:  The  Poisson  estimates  in  the  second  column  are  for  the  Poisson 
model  (23.32)  with  a,  =  a  estimated  by  the  Poisson  MLE  assuming  indepen¬ 
dence  over  i  and  t.  The  estimated  elasticity  is  0.693  compared  to  the  NLS  esti¬ 
mate  of  0.509.  The  default  standard  error  of  0.002  imposes  the  Poisson  restric¬ 
tion  of  variance-mean  equality  (see  Section  20.2.2).  Correcting  for  overdispersion 
using  the  sandwich  variance  matrix  estimate  (see  also  Section  20.2.2)  increases 
the  standard  error  estimate  to  0.020  and  emphasizes  the  importance  of  control¬ 
ling  for  any  overdispersion  in  count  data.  Additionally  controlling  for  correlation 
over  t  for  given  i  leads  to  an  even  higher  panel-robust  standard  error  estimate 
of  0.043. 

Pooled  GEE:  The  pooled  GEE  estimator  solves  (23.30),  where  g(x,f,  (3)  is  given  by 
(23.32)  with  a,  =  a.  The  particular  specification  of  the  working  matrix  E,  used 
here  is  given  after  (23.55).  The  estimated  elasticity  is  0.560  with  standard  error  of 
0.033  using  the  panel-robust  estimate  discussed  after  (23.30). 

Poisson-RE:  The  Poisson  random  effects  estimator  assumes  that  a,  =  In  y,  is  gamma 
distributed  (see  Section  23.7.2).  The  estimated  elasticity  is  0.349  with  default  stan¬ 
dard  error  of  0.033. 

Poisson-FE:  The  Poisson  fixed  effects  estimator  assumes  that  a,  =  In  y,  is  a  fixed 
effect,  and  it  is  estimated  as  in  Section  23.7.3.  The  estimated  elasticity  of  —0.038 
is  now  negative,  with  default  standard  error  of  0.033.  For  the  Poisson  fixed  effect 
model,  firms  with  yit  =  0  are  dropped,  leading  here  to  a  loss  of  22  x  5  =  110 
observations. 
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There  is  a  big  difference  between  fixed  and  random  effects  results,  favoring  fixed 
effects  estimation.  The  surprising  negative  estimated  elasticity  with  FE  arises  because 
the  model  is  too  simple.  In  particular,  R&D  expenditure  affects  patent  activity  with 
a  lag.  Replacing  lnx„  in  (23.31)  and  (23.32)  by  (J>i  In  x,- ,_/  leads  to  estimated 
elasticity  fii  given  in  the  second  last  row  of  Table  23.1.  The  FE  estimate  of  0.313 
is  still  less  than  the  other  estimates,  but  the  difference  is  now  reduced. 

23.4.  Binary  Outcome  Data 

We  consider  a  binary  outcome  in  which  yu  takes  only  the  values  0  and  1 .  For  example, 
data  may  be  available  on  whether  or  not  an  individual  is  employed  in  each  of  several 
time  periods.  A  key  result  is  that  fixed  effects  estimation  is  possible  for  the  logit  model 
but  not  the  probit  model. 

23.4.1.  Individual-Specific  Effects  Binary  Models 

The  natural  extension  of  the  binary  outcome  model  from  cross-section  data  (see  Sec¬ 
tion  14.3)  to  panel  data  with  individual-specific  effects  is  to  specify  that  yit  takes  only 
the  values  0  and  1 ,  with 

F{ctj  +  x'f/3)  in  general, 

Pr[y„  =  l|x;t,  /3,a,]  =  A(a;  +  x-(/3)  for  logit  model,  (23.33) 

<t>(a,  +  xJ,/3)  for  probit  model, 

where  F(-)  is  a  cumulative  distribution  function,  A(-)  is  the  logistic  cdf  with  A(z)  = 
ez/(l  +  ez),  and  <£(•)  is  the  standard  normal  cdf.  Given  (23.33)  and  assuming  condi¬ 
tional  independence,  the  joint  density  for  the  ith  observation  y,  =  (y,  i, . . . ,  _y, ■/- )  is 

T 

f  (y<  |X; ,  a, ,  f3)  =  f]  Fiat  +  xj,/3)*'(l  -  F(at  +  xj^))1^'.  (23.34) 

r=l 

For  binary  data  the  conditional  probability  is  also  the  conditional  mean,  so 

E[yif|ai,xl7]  =  F(«;+x;f/3).  (23.35) 

This  is  a  single-index  individual-specific  effects  model  (see  (23.5))  that  does  not  sim¬ 
plify  to  either  an  additive  or  multiplicative  effects  model.  Additive  and  multiplicative 
effects  models  are  not  appropriate  as  they  do  not  restrict  the  conditional  mean  and 
conditional  probability  to  lie  between  zero  and  one. 

Binary  panel  models  emphasize  the  parametric  model  (23.34),  since  binary  data 
must  be  Bernoulli  distributed.  The  conditional  mean  model  (23.35)  is  rarely  used, 
though  it  is  natural  to  use  this  if  regressors  are  endogenous. 

23.4.2.  Random  Effects  Binary  Models 

The  random  effects  MLE  assumes  that  the  individual  effects  are  normally  dis¬ 
tributed,  with  a,  ~  A/"[0,  rrj  ] .  The  random  effects  MLE  of  /3  and  rr(;  maximizes  the 
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log-likelihood  In  /(y,  |X( ,  (3,  o%),  where 

/(y.-|Xi,/3,o-«)  =  J  f(yi\Xi,ahf3)-j =1=  exp  (23.36) 

where  /(y,  |X;,  a,-,  /3)  is  given  in  (23.34)  with  F  —  A  for  the  logit  model  and  F  =  <P 
for  the  probit  model.  There  is  no  closed-form  solution  for  the  integral  (23.36)  and  it  is 
standard  to  compute  it  numerically  using  quadrature  methods. 

If  fixed  effects  are  not  present,  then  an  alternative  to  the  random  effects  model  is 
a  pooled  binary  model  that  simply  specifies  that  Pr  [yit  =  l|x,r]  =  F(x'it(3).  Statistical 
inference  should  then  be  based  on  panel-robust  standard  errors  (see  Section  23.2.6). 
More  efficient  estimation  is  possible  using  a  GMM  approach  (see  Avery  et  al.,  1983) 
or  a  GEE  approach  (see  Liang  and  Zeger,  1986). 


23.4.3.  Fixed  Effects  Logit 


Fixed  effects  estimation  is  possible  for  the  panel  logit  model,  using  the  conditional 
MLE,  but  not  for  other  binary  panel  models  such  as  panel  probit. 

For  the  logit  model  performing  some  algebra  given  in  Section  23.4.5  yields  that  the 
joint  density  of  y,=  (yn, yiT)  is 


/  (y;  Wi ,  x/ ,  (3) 


exp  («,-  g,  yit)  exp  ((£,  yit*it)  0) 
FI,  [l  +exp(a;  +  x!it(3)] 


(23.37) 


This  depends  on  a,-,  which  we  need  to  eliminate.  For  observation  i  there  are  yu 
outcomes  of  1  in  the  T  periods.  Define  the  set  B(  =  {d,  |  di,  =  g  y,,  =  c}  to  be 
the  set  of  ah  possible  sequences  of  Os  and  Is  for  which  the  sum  of  T  binary  outcomes 
yit  =  c.  Then  if  we  condition  on  g  yit  =  c  it  is  shown  in  Section  23.4.6  that  o',  is 
eliminated  and 


/(y, |  ^2 = 

t 


exp((L,  yirtt)0) 
EdeB,  exp  ((£,  dltx't)  (3 )  ’ 


(23.38) 


a  result  due  to  Chamberlain  (1980).  The  density  (23.38)  is  the  basis  for  conditional 
ML  estimation.  The  only  complication  is  that  there  are  many  sets  Bc.  and  sequences 
within  these  sets,  as  we  now  detail. 

First,  it  is  not  possible  to  condition  on  yit  =  0,  since  this  can  only  occur  if  ah 
yit  =  0,  and  similarly  for  Y2,  yit  =  T.  This  can  mean  considerable  loss  of  observations 
if,  for  example,  most  people  are  employed  in  all  periods. 

As  an  example  where  conditioning  works,  suppose  T  =  2  and  yit  =  1.  Then  ei¬ 
ther  the  sequence  {0,  1}  or  {1, 0}  is  possible,  and  the  conditional  probability  in  (23.38) 
implies  that,  for  example, 


Pr[.y,i  =  0,  yi2  =  l|yu  +  yn  =  1] 


exP  (x/i/3) 

exp  (x',/3)  +  exp  (x!n0) 

exp  ( (xf~  i  -  x,q)'/3) 

1  +  exp  ( (x/ 1  -  x,0)'/3) ' 
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If  T  =  3  then  we  can  condition  on  ^2tyn  =  1,  with  possible  sequences  {0,  0,  1}, 
{0,  1,  0}  and  {1,  0,  0},  or  on  ya  =  2,  with  possible  sequences  {0,  1,  1},  {1,  0,  1} 
and  {1,  1,0}.  Clearly  for  large  T  there  are  many  sequences  and  the  conditional  density 
can  get  complicated. 

The  conditional  density  is  that  of  a  conditional  logit  model,  where  parameters  are 
invariant  but  regressors  vary  over  alternatives.  The  number  of  alternatives  varies  across 
individuals,  where  for  individual  i  each  alternative  is  a  specific  sequence  of  0s  and  Is 
that  sum  to  yit.  It  is  easiest  to  use  computer  code  specifically  set  up  for  this  prob¬ 
lem.  Even  then  there  can  be  a  large  number  of  alternatives.  For  example,  if  T  =10  and 
yu  =  5  then  there  are  252  alternatives.  Consistent  but  less  efficient  estimation  is 
possible  by  dropping  some  observations,  such  as  for  individuals  with  many  alternatives 
because  of  a  high  yit,  or  by  reducing  the  number  of  time  periods. 

The  elimination  of  the  individual-effects  a,  makes  it  impossible  to  interpret  re¬ 
gression  coefficients  using  the  original  model  (23.37).  Instead,  we  use  the  conditional 
model  (23.38).  For  example,  suppose  we  have  single  regressor  and  /?  =  0.2.  Then  if 
we  consider  two  time  periods  and  condition  on  yn  =  1 ,  then 


Pr[y;i  =  0,  yl2  =  l|y,i  +  ya  =  1] 


exp (Mxg  -  *io» 

1  +  exp(/t(x,  i  -  *io))' 


It  follows  that  a  one-unit  difference  in  Xj  i  versus  xa  leads  to  a  conditional  probability 
of  this  sequence  being  exp(/))/[l  +  exp(/3 ) |  compared  to  a  probability  of  one-half  if 

xn  =  xi2- 


23.4.4.  Dynamic  Binary  Models 

Suppose  we  have  a  pure  time  series  first-order  Markov  logit  model  with  no  regressors 
other  than  the  lagged  dependent  variable: 


Pr[>’,(  =  Ik,  yit- il 


expfa,  +  yyn-i) 

1  +  exp(or,-  +  yyit-i)' 


Then  performing  some  algebra  given  in  Section  23.4.5  gives 


T- 1 


fiyiAyn,  ya-^yit,  y)  = 


exp 


(k  EJ=2  yityu-i) 


1=2 


Sdec,  exP  (v  Ht=2 


(23.39) 


(23.40) 


where  the  set  C,  =  {d,  [y,  | ,  yiT,  dn  =  yit)  is  the  set  of  all  possible  sequences 
of  0s  and  Is  for  which  the  sum  of  T  binary  outcomes  is  yit,  the  first  outcome  is 
y,  | ,  and  the  last  outcome  is  yiT. 

Conditional  MF  estimation  based  on  (23.40)  leads  to  a  consistent  estimate  of  y. 
The  minimum  number  of  time  periods  needed  is  four.  For  example,  if  y,  is  the  se¬ 
quence  {0,  1,0,  1}  then  the  set  C,  is  composed  of  the  sequences  {0,  1,0,  1}  and 
{0,  0,  1,  1}.  The  approach  is  due  to  Chamberlain  (1985),  who  actually  considered 
a  second-order  Markov  model.  Chay,  Hoynes,  and  Hyslop  (2001)  apply  this  method 
to  California  administrative  data  on  welfare  spells  and  find  that,  after  controlling  for 
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unobserved  individual  heterogeneity,  there  remains  true  state  dependence  in  welfare 
participation. 

The  preceding  results  and  discussion  apply  to  pure  time  series  models.  Honore  and 
Kyriazidou  (2000)  provided  a  method  that  allows  regressors  other  than  the  lagged 
dependent  variable.  Thus  let  (23.39)  become 


Pr[>’,(  =  1| oti,  yit- i,x,-f] 


exp(ai  +  x],/3  +  yyit- 1) 

1  +  exp(a,'  +  x'it(3 +yyit-i)' 


(23.41) 


Consider  four  time  periods  and  consider  sequences  with  common  binary  outcomes  in 
the  first  and  fourth  periods,  say  d\  and  t/4.  Then  the  probability  that  the  sequence  is 
[d\,  0,  1,  z/4},  given  that  it  is  either  {d\,  0,  1,  d 4}  or  {A,  1.  0,  d 4},  now  depends  on  a,. 
However,  the  dependence  on  a,  disappears  if  xy  =  x4, .  Since  few  observations  have 
xy  =  x^,  especially  with  continuous  data,  Honore  and  Kyriazidou  (2000)  propose 
use  of  kernel  smoothing  methods  with  kernel  weights  that  depend  on  (xy  —  x4, ).  Chay 
and  Hyslop  (2000)  provide  an  application  that  implements  this  method  and  many  other 
methods  for  dynamic  binary  data  models. 


23.4.5.  Multinomial  Models 

The  fixed  effects  estimator  can  be  generalized  to  the  multinomial  logit  model,  since 
this  model  implies  a  binary  logit  model  for  pairwise  comparison  of  alternatives  (see 
Section  15.4.3).  For  static  models  Chamberlain  (1980)  provides  a  brief  exposition  and 
M.-J.  Lee  (2002)  provides  more  details.  Magnac  (2000)  provides  a  quite  detailed  em¬ 
pirical  application  to  individual  transitions  among  six  different  states  in  the  French 
labor  market  using  dynamic  fixed  effects  logit  models  with  no  regressors  other  than 
lagged  dependent  variables.  Honore  and  Kyriazidou  (2000)  consider  the  multinomial 
logit  model. 

For  other  multinomial  models  a  random  effects  approach  is  necessary.  These  mod¬ 
els,  such  as  mixed  logit  and  multinomial  probit,  are  complicated  to  estimate  even  in 
the  cross-section  case.  For  details  see  Train  (2003). 


23.4.6.  Derivations  for  Fixed  Effects  Logit 

For  simplicity  suppress  the  subscript  i .  For  the  logit  model  the  joint  probability  of 
y  =  (yi, . . . ,  yr)  given  in  (23.34)  becomes 


f(  I  )  —  FT  (  exP (a  +  x'(3)  Y  ( _ 1 _ 

f=T  V  1  +  exp(ff  +  x',/3) )  V  1  +  exp(a  +  x'(3) 

_  exp(Ef  m«  +  x;/3)) 

I],  [l  +  exp(a  +  x',/3)] 

=  exp  (a  J2,  y,)  exp  ((£,  y,x[)  (3) 
fl,  [l  +  exp(a  +  x',/3)] 


l-.Vt 


(23.42) 


which  yields  (23.37). 
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The  quantity  Er  can  be  shown  to  be  a  sufficient  statistic  for  a  as  follows.  Suppose 
we  have  an  observation  for  y  such  that  E/  y,  =  c.  Define  the  set  Bc  =  {d|  E/  d,  =  c} 
to  be  the  set  of  all  possible  sequences  of  Os  and  Is  for  which  the  sum  of  T  binary 
outcomes  is  c,  and  condition  on  Er  yt  =  c.  Then 


/(ylE>v  =  c)  = 

t 


pr[y.  E,  yt  =  c] 

POE;  yt  =  c] 


Pr[y] 

pr[E,  yt  =  c] 


pr[y] 

EdeB,  Prld] 

exp  ((E/  ytx',)P) 

Ed6Bl.exp((E,^x;)/3)' 


(23.43) 


where  the  first  equality  uses  Bayes’  rule,  the  second  equality  uses  the  fact  that  knowl¬ 
edge  of  Ef  yt  does  not  add  anything  given  knowledge  of  y,  the  third  equality  uses 
the  fact  that  Pr[  E,  }’t  =  c]  equals  the  sum  of  the  probabilities  of  combinations  of  Os 
and  Is  that  equal  c,  and  the  fourth  uses  the  previous  definition  of  /( y)  and  consider¬ 
able  simplification  that  in  part  relies  on  E?  }’t  =  Et  dt  when  we  restrict  attention  to 
d  €  Bc  . 

Now  consider  the  dynamic  model.  Replacing  x'/3  in  (23.42)  by  yyt- i  yields 


exp  (a  Eh  >’<)  exP  (Eh  yyt-iy,) 

IT  [1  +  exp(a  +  yy,-i)] 

exp  e< h  yt)  exP  (Eh  yyt-iyt ) 

[l  +  exp(o')]^'=2<1_v'-l)  [i  +  exp(a:  +  y)]^'"2^-1 

exp  (a  Eh  yt)  exp  (Ef=2  Yyt-iVr) 

[1  +  exp(a!)]  (r-1+yi~yT+^>=2  y')+  [l  +  expffi  +  y)]^+E=2>'< 


where  the  second  equality  uses  the  fact  that  y,_i  is  either  0  or  1  and  follows  after 
some  algebra,  and  the  last  equality  uses  Eh  >'/- 1  =  Vi  —  yr  +  Eli  >’/ -  The  algebra 
is  then  similar  to  (23.43)  except  that  in  addition  to  conditioning  on  Eh  y>  we  also 
need  to  condition  on  vi  and  yr  that  appear  in  the  denominator.  Equivalently,  we  can 
condition  on  Eh  yt  and  Vi  and  yr-  This  yields 


/( y)  = 


E 


exp  (Eh  yyt-iyt) 

dec,  exP  (E,r=2  Ydt-idt) 


where  C  =  {d|ffi  =  yi,  dj  =  yr-  E/=i  dt  =  Eh  iVr)  is  the  set  of  all  possible  se¬ 
quences  of  Os  and  Is  for  which  the  sum  of  the  T  binary  outcomes  is  Er  >'•  the  first 
outcome  is  yi,  and  the  last  outcome  is  yr- 
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23.5.  Tobit  and  Selection  Models 

We  consider  censoring,  truncation,  or  selection  when  panel  data  are  available,  rather 
than  data  on  a  single  cross-section. 

A  pooled  analysis  simply  mirrors  analysis  in  the  cross-section  case,  with  the  adjust¬ 
ment  that  panel-robust  standard  errors  should  be  computed  (see  Section  23.2.8).  For 
example,  see  Grasdal  (2001)  who  considers  selection  resulting  from  panel  attrition. 

Here  we  focus  instead  on  panel  models  with  individual-specific  effects.  Then  ran¬ 
dom  effects  models  can  be  estimated,  if  the  strong  assumption  of  a  purely  random  ef¬ 
fect  is  warranted,  the  only  complication  being  that  of  numerical  computation.  There  are 
no  simple  consistent  estimators  for  fixed  effects  models,  however,  in  the  usual  microe- 
conometric  setting  of  a  short  panel.  More  complicated  semiparametric  estimators  that 
permit  fixed  effects  in  Tobit  and  generalized  Tobit  models  are  given  in  Section  23.8. 


23.5.1.  Censored  and  Truncated  Models 


For  cross-section  data  the  censored  Tobit  model  is  given  in  Section  16.3.1.  A  panel 
version  with  additive  individual-specific  effect  specifies 


y*t  —  oi  i  +  x'jt(3  +  sit,  (23.44) 

where  eit  ~  A/"[0,  ay],  and  we  observe  ylt  =  y*  if  y*  >  0  and  yir  =  0  or  is  observed 
to  be  missing  if  y*  <  0.  The  joint  density  for  the  ith  observation  y,=  (y,  i,  . . . ,  _y,y-) 
can  be  written  as 


/(y.-IX,-,  a,-,  /3,er2)  =  f] 


r=l 


1 

— 4>it 
a„ 


idi, 


[i  — 


ni  -d„ 


(23.45) 


where  <pit  =  4>((yu  -  a,  -  x'it(3 )/aB),  =  4>((or,-  +  x'it(3)/ae),  and  </>(•)  and  $>(•)  de¬ 

note,  respectively,  the  standard  normal  pdf  and  cdf. 

The  fixed  effects  MLE  maximizes  the  log-likelihood  based  on  (23.45)  with  respect 
to  (3,  err,  and  ai , . . . ,  otN.  In  short  panels  the  resulting  estimator  of  (3  is  inconsistent 
because  of  the  incidental  parameters  problem,  and  there  is  no  simple  differencing  or 
conditioning  method  that  can  provide  a  consistent  estimator.  Heckman  and  MaCurdy 
(1980)  applied  the  fixed  effects  MLE  to  female  labor  supply.  Although  recognizing  the 
inconsistency  of  the  estimator,  they  argued  that  with  T  =  8  the  inconsistency  may  not 
be  too  great.  Greene  (2004a)  provides  a  recent  Monte  Carlo  study  for  the  fixed  effects 
Tobit  MLE. 

Random  effects  estimation  is  more  commonly  used  because  of  inconsistency  of  the 
fixed  effects  estimator.  Under  the  assumption  that  a,  ~  Af[0,  cry  ]  the  random  effects 
MLE  of  /3,  cr2,  and  ay  maximizes  the  log-likelihood  In  /(y,|X,',  /3,  <rf2,  <t2), 
where 

/  (y,-  |X; ,  (3 ,  er2,  ay)  =  J  f  (y,  |X« ,  a, ,  (3,  ay)  ~j=y  exP  (t^)  doii ’  (23-46) 

for  at,  (3,  ay)  given  in  (23.45).  This  one-dimensional  integral  can  be  com¬ 

puted  using  Gaussian  quadrature. 
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This  approach  can  be  extended  to  other  models  with  censoring  or  truncation.  For 
example,  a  right-censored  version  of  the  Poisson  random  effects  model  in  Section 
23.7.2  may  be  used  if,  say,  counts  above  10  are  recorded  only  as  10  or  more. 

There  are  two  weaknesses  to  the  fully  parametric  approach.  First,  as  in  the  cross- 
section  case  reliance  on  distributional  assumptions  becomes  much  greater  when  there 
is  censoring  or  truncation.  Second,  the  assumption  of  purely  random  effects  indepen¬ 
dent  of  regressors  may  be  too  strong. 


23.5.2.  Selection  Models 

Selection  models  can  arise  in  panel  data  for  the  same  reasons  as  in  the  cross-section 
case  (see  Section  16.5).  A  generalization  of  the  Tobit  type  2  model  in  Section  16.5.1 
to  a  linear  panel  model  with  individual  specific  effects  and  <5,-  is 

y*  =ai+x!it(3+elt,  (23.47) 

d*t  =  Sj  +  z'itj +vit, 

where  yit  =  y*  is  observed  if  d*  >  0  and  y,,  is  not  observed  otherwise. 

For  the  random  effects  formulation  the  four  unobservables  are  assumed  to  be  nor¬ 
mally  distributed.  Flausman  and  Wise  (1979)  proposed  ML  estimation,  which  involves 
a  bivariate  integral  as  o',  may  be  correlated  with  <5,  and  sit  may  be  correlated  with  %. 

The  fixed  effects  estimator  is  inconsistent  in  short  panels.  Note,  however,  that  if 
d*  =  8j,  so  that  selection  is  due  only  to  time-invariant  characteristics  of  the  individual, 
which  may  be  observed  or  unobserved,  then  the  fixed  effects  estimator  in  the  model 
yit  =  a,  +  x';/3  +  Sjt  is  consistent.  A  fixed  effect  panel  model  controls  for  sample  se¬ 
lection,  to  the  extent  that  it  depends  on  time-invariant  characteristics. 

Verbeek  and  Nijman  (1992)  provide  a  more  detailed  discussion  of  the  essential  as¬ 
sumptions  needed  for  consistent  estimation  in  these  model  and  propose  tests  for  selec¬ 
tivity  bias.  Wooldridge  (1995)  provides  a  similar  analysis  under  weaker  assumptions 
and  presents  assumptions  that  may  not  be  too  restrictive  in  some  applications  that  per¬ 
mit  consistent  estimation  of  the  fixed  effects  model.  Vella  (1998)  provides  a  review 
and  additional  references. 

The  methods  for  sample  selection  can  be  extended  to  panel  attrition  (see  Sec¬ 
tion  21.8.5)  that  leads  to  attrition  bias  if  observations  on  the  dependent  variable  are 
lost  in  a  nonrandom  manner.  Then  all  data  for  the  /7th  observation  are  not  observed 
if  d*  <  0,  so  z,(  in  (23.47)  needs  to  be  replaced  by  variables  observed  in  periods 
other  than  period  t.  An  early  example  is  Flausman  and  Wise  (1979),  and  a  more  re¬ 
cent  application  is  Grasdal  (2001).  Baltagi  (2001)  and  Hsiao  (2003)  provide  further 
references. 


23.6.  Transition  Data 

For  concreteness  consider  panel  data  on  welfare  spells.  Great  interest  lies  in  measuring 
individual  persistence  in  welfare  spells,  and  determining  the  extent  to  which  this  is  due 
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to  true  state  dependence  rather  than  differences  in  individual  propensities  to  be  on  wel¬ 
fare.  Since  individual  propensities  may  depend  in  part  on  unobservables,  models  with 
individual-specific  effects  should  be  used.  For  duration  data  there  exists  an  unusually 
wide  range  of  modeling  approaches,  as  several  types  of  panel  data  on  transitions  are 
possible.  Here  we  focus  on  fixed  effects  models. 

Data  may  be  available  on  whether  or  not  an  individual  is  in  a  state  at  several  points 
in  time,  such  as  on  welfare.  Then  one  can  use  a  binary  panel  model  (see  Section  23.4), 
such  as  the  dynamic  fixed  effects  logit  model. 

Richer  data  provide  information  on  the  durations  of  several  individual  spells.  A 
natural  starting  point  is  then  a  panel  proportional  hazards  model 

Htij\Xij)  =  Xjitij ,  y/)exp(x';/3)a, ,  (23.48) 

where  is  the  completed  spell  duration  for  the  / th  spell  of  the  /th  individual  and  a,  is 
an  individual-specific  effect.  This  is  the  mixed  proportional  hazards  model,  discussed 
for  single-spell  data  in  Chapter  18.  The  conditions  for  nonparametric  identification  of 
the  MPH  model  with  only  single-spell  data  (see  Section  18.3)  include  the  assumption 
that  a,  are  distributed  independently  of  the  regressors.  This  rules  out  fixed  effects. 
Once  multiple  spells  become  available,  however,  Honore  (1992)  showed  that  a,  can 
be  a  fixed  effect  if  x(/-  is  constant  over  j  (see  Section  19.4.1).  For  further  discussion  of 
the  model  (23.48),  including  a  dynamic  duration  model  with  hazard  function  for  the 
second  spell  dependent  on  the  duration  of  the  first  spell,  see  Section  19.4.1. 

Chamberlain  (1985)  presented  several  approaches  for  elimination  of  a,  in  various 
panel  duration  models.  For  the  MPH  model,  with  baseline  hazard  A ,j(-)  the  same  across 
spells  j,  the  probability  that  the  second  spell  is  longer  than  the  first  spell  does  not 
depend  on  o', .  Conditional  ML  can  be  applied  to  the  gamma  duration  model,  since  the 
gamma  is  an  LEF  density.  For  Weibull,  gamma  and  log-normal  models  the  density  of 
Cl  / tj2  does  not  depend  on  a,-. 

For  more  recent  references  and  a  detailed  discussion,  including  sensitivity  of 
multiple-spell  data  to  censoring,  see  Van  den  Berg  (2001). 


23.7.  Count  Data 

Hausman  et  al.  (1984)  presented  estimable  fixed  effects  and  random  effects  models  for 
both  panel  Poisson  and  panel  negative  binomial  models.  More  recent  work  has  empha¬ 
sized  fixed  effects  in  multiplicative  effects  models,  permitting  estimation  of  static  and 
dynamic  models  under  relatively  weak  distributional  assumptions. 


23.7.1.  Individual-Specific  Effects  Count  Models 

We  focus  on  the  Poisson  model,  detailed  for  cross-section  data  in  Section  20.2,  though 
panel  versions  of  negative  binomial  are  also  briefly  considered. 

The  Poisson  individual-specific  effects  model  specifies  that  yir  ~  V[oa  cxp(x'(/3)|. 
Then,  assuming  conditional  independence,  the  joint  density  for  the  i th  observation 
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y,-=  (yn,  ■■■,  yn) is 

T 

/(y,|XM  /3)  =  ]~ [  exp[— a,  exp(x(f/3)][-a,-  expfx',/?)]-''"/}^!-  (23.49) 

/=i 

A  less  parametric  approach  simply  models  the  conditional  mean  as 

E[yit\ai,  x„]  =  cii  exp(x',yS)  (23.50) 

=  expty,  +x',^). 

This  is  both  a  single-index  individual-specihc  effects  model  and  a  multiplicative  ef¬ 
fects  model.  Since  it  is  a  multiplicative  effects  model  the  individual  effects  a,  can 
be  eliminated  by  mean  differencing  or  first  differencing.  Note  that  the  Poisson  panel 
model  (23.49)  has  conditional  mean  (23.50). 


23.7.2.  Random  Effects  Count  Models 

Assuming  gamma-distributed  random  effects  leads  to  a  tractable  solution  for  the 
marginal  density  of  the  random  effects  model.  Assume  o',  is  Q [  q ,  q  ]  distributed  with 
mean  1,  variance  \/q,  and  density  g(a,  1 77)  =  1)1  a-  e~ain /  T(rj).  Then  (23.18)  for  the 
Poisson  model  (23.49)  becomes 


f(yi\Xi,/3, 17) 


n 


yu ! 


hit  +  TJ 


V 


X 


r  (Er  yu  +  7) 

T(fj) 


(23.51) 


where  Xit  =  cxp(x';/3)  and  derivations  are  given  in  Section  23.7.5.  The  resulting  first- 
order  conditions  for  the  Poisson  random  effects  estimator  (3  can  be  expressed  as 


yi  +  rj/  T  \ 
"Xi  +  rj/Tj 


—  0, 


(23.52) 


where  A ,t  =  T  1  J2,  exp(x^/3). 

The  term  on  the  left-hand  side  of  (23.52)  has  expected  value  zero  if  the  mean  con¬ 
ditional  on  regressors  in  all  periods  E[y,,  \a-t ,  x(  i , . . . ,  x,7  ]  =  o',  cxp(x';/9).  So  despite 
all  the  parametric  assumptions  made,  the  Poisson  random  effects  estimator  is  con¬ 
sistent  for  (3  under  the  relatively  weak  assumption  that  the  conditional  mean  is  that 
given  in  (23.50)  and  that  regressors  are  strongly  exogenous.  For  the  density  (23.51), 
E[y;,  |x,  ]  =  Xj,  and  V[y,f|x,  ]  =  Xit  +  Xft/8,  so  that  overdispersion  is  of  the  NB2  form. 
A  sandwich  estimate  of  the  variance  matrix  will  permit  more  flexible  models  of 
overdispersion  and  conditional  correlation.  The  first-order  conditions  for  q  (not  given) 
are  quite  complicated  though  the  information  matrix  is  block  diagonal  in  (3  and  q. 

Several  alternative  estimators  are  available  given  random  effects.  First,  the  pooled 
Poisson  estimator  ignores  the  random  effects  and  assumes  yq,  |x,-f  ~  P[cxp(x';/3)|.  This 
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has  first-order  conditions 


N  T 


T,  T,  *ir  (y,t  -  ht )  =  0. 

1=1  (=1 


(23.53) 


where  Xit  =  exp (x'jt/3).  This  estimator  is  consistent  if  the  conditional  mean  is  (23.50) 
with  E[q!,  |x„]  =  1.  Thus  the  usual  cross-section  Poisson  MLE  is  consistent  if  the  true 
model  is  one  with  multiplicative  random  effects.  However,  as  illustrated  in  the  Section 
23.3  example,  panel-robust  standard  errors  should  be  used.  Here  (23.26)  yields 


-l 


-l 


V[3pool]  = 


^ itxitxit 
i,t 


^  '  U  i  [  U !\ A !  X,  / 
i,t,s 


^ itxitxit 
i,t 


(23.54) 


where  X,,  =  exp(xj,/3),  uit  =  yit  -  Xit,  denotes  Tj=i  and  denotes 

Y-J=\  Y^=  i-  An  alternative  pooled  estimator  based  on  (23.50)  is  NLS,  in  which 
case  (23.53)  becomes  J2,  Xuxu  (f'n  -  ht )  =  0. 

Second,  more  efficient  pooled  estimation  may  be  possible  using  the  GEE  approach 
of  Section  23.2.8,  which  introduces  conditional  correlation.  The  general  result  (23.30) 
for  git  =  Xit  =  exp becomes 


^z;53ir1(yI.-AI-)  =  0, 


(23.55) 


;=i 


where  Z,  is  a  T  x  K  matrix  with  fth  row  observation  Xitx'jt,  and  A,  is  a  T  x  I  vector 
with  rth  entry  /,,,.  Several  different  working  variance  matrices  £,  for  V[y,  |X,]  are 
possible.  The  choice  S,  =  Diag[ ]  yields  the  pooled  Poisson  estimating  equations  in 
(23.53).  Letting  E,/(  =  Xjt  and  E,/v  =  X,s  =  <j>\/XitXiS  for  s/f  permits  correlation 
over  t  that  is  equicorrelated  or  exchangeable  since  the  correlation  is  a  constant  </> 
for  s  t. 

Third,  more  efficient  pooled  estimation  may  be  possible  using  ML  with  the  neg¬ 
ative  binomial  rather  than  the  Poisson  as  the  starting  point.  Suppose  yu  is  iid  neg¬ 
ative  binomial  with  NB2  variance  function  with  parameters  a,!,-,  and  </>,  (see  Sec¬ 
tion  20.4.1),  implying  yit  has  mean  c/jXit/<pl  and  variance  (a,  /</>,)  x  (1  +  cr, ■/</>,). 

If  (1+  o',  /0,)— 1  is  a  beta-distributed  random  variable  with  parameters  (i]\,  1)2), 
then  after  some  considerable  algebra  (23.18)  reduces  to 


/( y,-|X;,/3,i?) 


/ 1— r  T(k,f  +  Vi;)!  \ 

r(A/t)!T(y/f  +  1)! ) 

r  On  +  m) r  (m  +  £,  xu) r  +  J2,  yu) 
r  On)  r  On)  r  {m  +  m  +  J2,  xu  +  12,  yu) 


(23.56) 


where  Xit  =  exp(x-r/3).  This  is  the  basis  for  ML  estimation  of  (3,  ,]  \ ,  and  r] 2.  This 
model  relies  on  stronger  assumptions  than  does  the  Poisson  random  effects  model. 

Fourth,  analysis  need  not  be  restricted  to  parametric  models  with  closed-form  so¬ 
lutions  for  /3m)-  Crepon  and  Dugeut  (1997a)  use  maximum  simulated  like¬ 

lihood  methods  to  estimate  hurdle  and  zero-inflated  panel  count  models  with  joint 
normal  random  effects. 
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23.7.3.  Fixed  Effects  Count  Models 

The  fixed  effects  estimator  for  the  Poisson  panel  model  (23.50)  can  be  derived  in  sev¬ 
eral  ways. 

First,  the  Poisson  MLE  simultaneously  estimates  (3  and  a The  log- 
likelihood  based  on  (23.49)  is 


lnL(/3,  a)  =  In 


E 

i 


nni^  -aiXit){aiXit)y“  / yit !} 

i  t 

-a,-  ^2  Xit  +  In  at  ^2  yit  +  E  yit ln  Xit  ~ 

t  t  t 


E ln  ! 


(23.57) 


where  Xit  =  cxp(x';/3).  Differentiating  with  respect  to  a,  and  setting  to  zero  yields 
a,  =  Y2t  yit/  Xa  'Kit  ■  Substituting  this  back  into  (23.57)  yields  the  concentrated  like¬ 
lihood  function.  Dropping  terms  not  involving  f3,  we  get 


ln  L,., 


c(/3)  OC  E 


yit  Ink, 7  -  yu  ln 


(23.58) 


It  follows  that  for  the  Poisson  fixed  effects  model  there  is  no  incidental  parameters 
problem.  Consistent  estimates  of  (3  for  fixed  T  and  N  oc  can  be  obtained  by  max¬ 
imization  of  ln  Lconc(/3)  in  (23.58).  Differentiation  of  (23.58)  with  respect  to  /3  yields 
first-order  conditions 


EE 


yi,xit  -  yit 


i  t 

which  can  be  reexpressed  as 


E^ 


E^ 


=  0, 


EEX''  (yn  -  )  =  o, 


,= 1 1= i 


(23.59) 


where  k,r  =  exp(xJr/3)  and  k,  =  T  1  Y/,  exp(x'(/9);  see  Blundell,  Griffith,  and 
Windmeijer  (1995).  The  Poisson  panel  model  (23.49)  and  the  linear  panel  model  of 
Section  21.6  are  unusual  in  that  simultaneous  estimation  of  / 3  and  a  provides  consis¬ 
tent  estimates  of  (3  in  short  panels,  so  there  is  no  incidental  parameters  problem. 

Second,  the  conditional  MLE  eliminates  the  fixed  effects  by  conditioning  on  a  suffi¬ 
cient  statistic  for  a,  .  For  the  Poisson  panel  model  this  is  ^  yit.  Some  algebra  given  in 
Section  23.7.5  shows  that  this  leads  to  a  conditional  log-likelihood  function  that  is  pro¬ 
portional  to  the  concentrated  log-likelihood  function  given  in  (23.58).  It  follows  that 
the  conditional  ML  estimator  for  (3  in  the  fixed  effects  Poisson  model  solves  (23.59). 
This  was  the  original  derivation  of  the  Poisson  fixed  effects  estimator  of  / 3  by  Palmgren 
(1981)  and  Flausman  et  al.  (1984). 

Third,  the  mean-differenced  transformation  (23.14)  for  the  multiplicative  effects 
model  (23.50)  implies  that  E[y,-,  —  (k,r/k,)y,  |x,i, . . . ,  x/j]  =  0,  and  hence 

E[x, ,(>>,,  -  (k,r/k,)y,)]  =  0.  (23.60) 
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Using  the  corresponding  sample  moment  conditions  leads  to  an  estimator  (3  that  solves 
(23.59). 

The  same  estimator  has  been  obtained  in  three  different  ways.  The  third  deriva¬ 
tion  makes  it  clear  that  the  essential  assumption  for  the  consistency  of  the  Poisson 
fixed  effects  estimator  is  that  regressors  are  strongly  exogenous  and  (23.50)  is  cor¬ 
rectly  specified.  Inference  should  be  based  on  panel-robust  standard  errors.  In  partic¬ 
ular,  if  the  usual  default  ML  or  conditional  ML  output  is  used,  following  the  first 
two  derivations,  standard  errors  may  be  considerably  understated  owing  to  failure 
to  control  for  overdispersion  in  the  count  data.  The  fixed  effects  estimator  leads  to 
some  loss  of  data,  as  observations  i  with  }’ii  =  0  do  not  contribute  to  the  sum 
in  (23.59). 

Consistent  estimation  of  (3  in  the  presence  of  fixed  effects  is  also  possible  for  a 
particular  parameterization  of  the  negative  binomial  model.  Hausman  et  al.  (1984)  as¬ 
sumed  that  yit  is  iid  NB1  with  parameters  and  0,-,  where  Xit  =  exp(x-r/3),  so 
yit  has  mean  a,A.((/0;  and  variance  (a,- A.,,/0/)  x(l  +  a,70,).  The  parameters  a,-  and 
0,  can  only  be  identified  up  to  the  ratio  «,/0,,  and  this  ratio  drops  out  of  the  condi¬ 
tional  joint  density  for  the  / th  observation,  which  after  considerable  algebra  can  be 
shown  to  be 


/Cm.  •  •  • ,  ml  Z!m) 

t 


1  r  T(A/,  +  >’,■()  \ 

1  r(A.If)r(m  +  D / 

.  r  (E,  M  r  (E,  m  +  i) 
r(E^.-r  +  E,m) 


(23.61) 


This  distribution  for  integer  k/t  is  the  negative  hypergeometric  distribution.  The 
conditional  ML  negative  binomial  fixed  effects  estimator  of  (3  maximizes  the  log- 
likelihood  function  based  on  (23.61).  The  Poisson  fixed  effects  model  is  more  com¬ 
monly  used  since  it  is  consistent  under  much  weaker  distributional  assumptions. 


23.7.4.  Dynamic  Count  Models 

There  are  several  ways  to  bring  dynamics  into  a  count  data  model.  Pure  time  se¬ 
ries  models  are  surveyed  in  Cameron  and  Trivedi  (1998).  For  simplicity  consider 
inclusion  of  one  lagged  dependent  variable.  The  obvious  model  is  E[yf|yf_i,  x,]  = 
expty  v,-  i  +  x't(3),  but  this  can  lead  to  explosive  behavior  as  a  result  of  exponentiation 
of  y,_  i .  A  more  stable  model  may  be  obtained  by  instead  using  cxpty  In  y,_  |  +  x'  (3), 
but  this  then  runs  into  problems  when  yt_ \  =  0.  For  this  reason  an  appealing  model 
is  the  linear  feedback  model  E[y,  ,  xr]  =  yy,  - 1  +  exp(xj/3).  The  Poisson  integer¬ 
valued  AR(1)  model  has  this  property  and  in  the  pure  time  series  case  has  correla¬ 
tion  function  Cor[yr,  yt~k]  =  yk,  similar  to  that  for  the  AR(1)  model  (see  Al-Osh  and 
Alzaid,  1987). 

Thus  Blundell,  Grifffiths,  and  Windmeijer  (1995,  2002)  considered  the  dynamic 
fixed  effects  panel  data  model  with 

E[v/,|a/,  yi,,_i,  xif]  =  yytj-i  +  a,-  exp(x',/3). 
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Applying  the  first-difference  transformation  (23.15)  leads  to  conditional  moment 
restrictions 


exptx- ,_i/3) 
L  explx',/3) 


(yn  -  Yyi.t- 1) -  -  yyi,t-2)\yn,  •  •  • ,  yu- 2, xn,  —  1 


=  0. 


These  lead  to  many  unconditional  moment  conditions  (see  Section  22.5.3  for  a  similar 
discussion  for  the  linear  model)  that  can  supply  the  basis  for  GMM  estimation  as  in 
Section  23.2.6.  Crepon  and  Dugeut  (1997b),  Montalvo  (1997),  and  Blundell,  Griffith, 
and  Van  Reenen  (1995,  1999)  use  similar  quasi-differencing  methods,  with  application 
to  the  Patents-R&D  relationship. 

Bockenholt  (1999)  uses  a  more  parametric  model,  estimating  a  Poisson  integer¬ 
valued  AR(1)  model  with  unobserved  heterogeneity  modeled  using  a  finite  mixture 
distribution  (see  Section  18.5). 


23.7.5.  Derivations  for  Random  and  Fixed  Effects  Poisson 


First,  consider  a  random  effects  Poisson  model  with  gamma-distributed  random  ef¬ 
fects.  For  simplicity  suppress  the  subscript  i  and  let  /,,  =  exp(x'/3).  The  general  for¬ 
mula  (23.18)  for  the  Poisson  model  (23.49)  and  random  effects  density  g(a\y )  yields 


f(y  1,  •••  ,}y|x,) 


=/” 
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For  g(<Xi\rj)  =  f]71  a11- 1  e~ar] /  similar  algebra  to  that  in  Section  20.4.1  yields  the 

density  given  in  (23.51). 

Second,  derive  the  conditional  density  for  the  Poisson  fixed  effects  model  for  obser¬ 
vations  in  all  time  periods  for  a  given  individual,  where  for  simplicity  the  individual 
subscript  i  is  dropped.  In  general  the  density  of  yi, . . . ,  yr  given  E,  y,  is 
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where  the  second  equality  arises  because  knowledge  of  ^  y,  adds  nothing  given 
knowledge  of  y\ ,  . . . ,  vy,  the  third  equality  specializes  to  y,  iid  V\ji,  \  and  hence  .Vr 
is  and  the  fourth  and  fifth  equalities  simplify.  The  conditional  density  is 

that  of  the  multinomial  for  y,  trials  where  the  fth  of  T  distinct  outcomes  occurs  in 
any  trial  with  probability  fM,/ /xs .  Setting  /r,,  =  a,  expix',/3)  and  taking  logarithms 
yields  conditional  likelihood  that  is  proportional  to  the  concentrated  log-likelihood 
given  in  (23.58). 


23.8.  Semiparametric  Estimation 

The  semiparametric  literature  for  panel  data  has  emphasized  models  for  limited  de¬ 
pendent  variables  since,  as  for  cross-section  data,  parametric  assumptions  become 
much  more  important  when  truncation,  censoring,  or  selection  are  present.  Attention 
focuses  on  models  with  fixed  effects.  We  provide  a  brief  summary. 

For  binary  data  Manski  (1987)  extended  his  maximum  score  estimator  from  cross- 
section  models  to  the  panel  model  with  fixed  effects  given  in  (23.33)  where  now  the 
function  F(-)  is  no  longer  specified.  Although  this  estimator  is  consistent  it  converges 
at  rate  less  than  \/~N  and  is  not  asymptotically  normal. 

For  the  Tobit  model  Honore  (1992)  extended  the  censored  LAD  approach  of  Powell 
(1986a)  to  the  panel  fixed  effects  model  (23.45)  where  the  distribution  of  the  error 
term  e-tt  is  unspecified.  The  data  are  artificially  trimmed  so  that  the  fixed  effect  is 
subsequently  eliminated  by  appropriate  differencing.  The  estimator  is  \f~N  consistent 
and  asymptotically  normal. 

For  panel  data  with  sample  selection  Kyriazidou  (1997)  considered  the  fixed  effects 
version  of  the  type  2  Tobit  model  (23.47)  where  the  distribution  of  the  errors  e„  and 
Vi,  is  unspecified.  She  presented  a  Heckman-type  two-step  estimator.  A  smoothed  ver¬ 
sion  of  the  maximum  score  estimator  of  Manski  (1987)  eliminates  the  fixed  effect  in 
the  selection  equation,  although  a  quite  complicated  differencing  procedure  is  used  in 
the  second  stage  to  eliminate  the  fixed  effect  in  the  outcome  equation.  This  approach 
can  be  generalized  to  other  generalized  Tobit  models.  Charlier,  Melenberg,  and  van 
Soest  (2001)  provide  an  application  to  a  panel  version  of  the  Roy  model  or  type  5 
Tobit  model. 

Censoring  is  common  in  duration  models.  Section  23.6  focused  on  panel  models 
with  completed  spells.  When  both  complete  and  incomplete  spells  are  observed  for 
an  individual,  partial  likelihood  methods  are  inappropriate,  since  censoring  is  not  in¬ 
dependent  given  presence  of  the  time-invariant  fixed  effect.  Horowitz  and  Lee  (2004) 
propose  a  consistent  estimator  for  the  MPH  model  (23.43)  with  incomplete  spells  that 
does  not  require  specification  of  the  baseline  hazard. 


23.9.  Practical  Considerations 

As  was  the  case  for  linear  models,  if  panel  data  are  used  then  at  a  minimum  infer¬ 
ence  needs  to  be  based  on  panel-robust  standard  errors.  These  are  not  provided  by  a 
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computer  pregram  for  cross-section  data  unless  it  has  an  option  for  clustered  standard 
errors,  in  which  case  clustering  is  specified  to  be  by  the  individual. 

More  efficient  estimation  is  available  using  models  that  incorporate  serial  corre¬ 
lation.  Econometricians  emphasize  random  effects.  Several  packages  fit  models  with 
normally  distributed  random  effects,  using  Gaussian  quadrature  to  integrate  out  the 
effect,  as  well  as  the  more  specialized  analytically  tractable  random  effects  count  data 
models.  Statisticians  instead  emphasize  the  GEE  approach  for  GLMs,  available  in 
many  statistical  packages  and  some  econometrics  packages. 

These  preceding  methods  lead  to  inconsistent  estimation  if  the  random  effect  is 
correlated  with  regressors.  Econometricians  therefore  emphasize  the  fixed  effects  ap¬ 
proach.  Because  of  the  incidental  parameters  problem,  this  yields  consistent  estimates 
in  short  panels  for  only  a  subset  of  nonlinear  models.  Econometrics  packages  are  avail¬ 
able  for  conditional  ML  estimation  of  these  models,  the  fixed  effects  logit  and  fixed 
effects  count  models.  If  a  fixed  effects  model  is  infeasible  then  random  effects  models 
richer  than  the  simplest  iid  random  effects  model  might  be  used. 

Dynamic  panel  models  can  also  be  estimated.  These  permit  distinction  between 
persistence  caused  by  unobserved  heterogeneity  and  persistence  caused  by  true  state 
dependence.  Implementation  may  require  writing  one’s  own  programs. 


23.10.  Bibliographic  Notes 

This  chapter  provides  an  overview  of  a  vast  and  divergent  literature  and  of  necessity  skips  many 
details.  The  monographs  on  panel  data  by  Arellano  (2004),  Baltagi  (2001),  Hsiao  (2003),  and 
M.-J.  Lee  (2002)  provide  considerable  treatment  of  panel  models  for  binary  data  and  censored 
and  selected  models.  Panel  models  for  counts  are  presented  in  Cameron  and  Trivedi  (1998)  and 
M.-J.  Lee  (2002).  Wooldridge  (2002)  presents  panel  methods  for  binary,  censored,  and  count 
data.  The  statistical  literature  for  various  generalized  linear  models  is  summarized  in  Fahrmeier 
and  Tutz  (1994)  and  Diggle  et  al.  (1994,  2002).  Various  papers  in  Matyas  and  Sevestre  (1995) 
consider  nonlinear  panel  models.  M.-J.  Lee  (2002)  emphasizes  GMM  estimation.  Arellano  and 
Honore  (2001)  emphasize  semiparametric  methods  for  nonlinear  panel  models.  Bayesian  esti¬ 
mation  with  panel  data  is  presented  in  Koop  (2003). 

23.2  For  general  discussion  of  the  incidental  parameters  problem  see  Lancaster  (2002).  Key  ref¬ 
erences  are  Andersen  (1970)  for  conditional  ML  and  Chamberlain  ( 1992)  and  Wooldridge 
(1997a)  for  differencing  methods.  For  random  effects  models  Butler  and  Moffitt  (1982) 
detail  use  of  Gaussian  quadrature  to  eliminate  normally  distributed  random  effects, 
whereas  the  statistics  literature  emphasizes  the  LEE  approach  of  Liang  and  Zeger  (1986). 

23.4  For  fixed  effects  logit  models  key  references  are  Chamberlain  (1980)  for  static  models, 
Chamberlain  (1985)  for  pure  time  series  dynamic  models,  and  Honore  and  Kyriazidou 
(2000)  for  dynamic  models  with  additional  regressors.  See  also  Hsiao  (1995). 

23.5  For  selection  in  panel  data  see  the  survey  by  Vella  (1998)  and  the  texts  by  Baltagi  (2001) 
and  Wooldridge  (2002). 

23.6  Chamberlain  (1985)  presents  several  ways  to  eliminate  fixed  effects  in  various  duration 
models.  Van  den  Berg  (2001,  section  6)  provides  a  good  discussion  and  many  references. 
Event  history  analysis  using  multiple-spells  data  on  individuals  is  more  complicated  than 
most  panel  analysis  as  the  models  are  intrinsically  dynamic. 


809 


NONLINEAR  PANEL  MODELS 


23.7  The  classic  reference  for  panel  count  data  models  is  Hausman  et  al.  (1984).  For  dynamic 
models  see  Blundell  et  al.  (2002). 

23.8  For  a  survey  of  panel  semiparameteric  methods  see  Arellano  and  Honore  (2001)  and  also 
L.-F.  Lee  (2001). 


- Exercises - 

23-1  Consider  the  nonlinear  panel  data  model  yit  =  a-,  +  exp(xj(/3)  +  ult,  where  (3  are 
parameters  to  be  estimated,  on,  /'  =  1 , . . . ,  N,  are  individual  specific  effects,  Un 
are  iid  [0.  of]  errors,  and  the  panel  is  short. 

(a)  Suppose  that  all  a,-  =  0.  Can  (3  be  consistently  estimated?  If  yes,  provide 
the  formula  or  objective  function  for  a  consistent  estimator.  If  no,  give  a  brief 
explanation  of  why  f3  cannot  be  consistently  estimated. 

(b)  Suppose  the  individual-specific  effects  a ■,  are  random  and  are  iid  [0,  of]  dis¬ 
tributed  independently  of  the  regressors.  Can  f3  be  consistently  estimated? 
If  yes,  provide  the  formula  or  objective  function  for  a  consistent  estimator.  If 
no,  give  a  brief  explanation  of  why  (3  cannot  be  consistently  estimated. 

(c)  Suppose  the  individual  specific  effects  a,-  are  random  but  are  correlated  with 
the  regressors.  Can  (3  be  consistently  estimated?  If  yes,  provide  the  formula 
or  objective  function  for  a  consistent  estimator.  If  no,  give  a  brief  explanation 
of  why  f3  cannot  be  consistently  estimated. 

23-2  (Adapted  from  Chamberlain,  1980)  Show  that  MLE  in  a  binary  logit  panel 
model  is  inconsistent,  with  plim  of  2 p  in  a  simple  T  =  2  model. 

23-3  Use  the  same  model  for  the  Patents-R&D  data  as  in  Section  23.3,  except  vary 
the  dependent  variable  and  model  as  suggested  in  the  following.  In  each  case 
estimate  random  effects  models  and,  if  theoretically  feasible,  a  fixed  effects 
model. 

(a)  Use  a  logit  model  of  whether  or  not  the  firm  has  a  patent. 

(b)  Use  a  truncated  tobit  model  of  number  of  log(Patents)  with  observations  of 
firms  with  zero  patents  dropped. 

(c)  Use  a  Poisson  model  for  number  of  patents. 
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In  empirical  work  data  frequently  present  not  one  but  multiple  complications  that  need 
to  be  dealt  with  simultaneously.  Examples  of  such  complications  include  departures 
from  simple  random  sampling,  clustering  of  observations,  measurement  errors,  and 
missing  data.  When  they  occur,  individually  or  jointly,  and  in  the  context  of  any  of 
the  models  developed  in  Parts  4  and  5,  identification  of  parameters  of  interest  will 
be  compromised.  Three  chapters  in  Part  6  -  Chapters  24,  26,  and  27  -  analyze  the 
consequences  of  such  complications  and  then  present  methods  that  control  for  these 
complications.  The  methods  are  illustrated  using  examples  taken  from  the  earlier  parts 
of  the  book.  This  feature  gives  points  of  connection  between  Part  6  and  the  rest  of  the 
book. 

Chapter  24,  which  deals  with  several  features  of  data  from  complex  surveys,  notably 
stratified  sampling  and  clustering,  complements  various  topics  covered  in  Chapters  3, 
5,  and  16.  Chapter  26  which  deals  with  measurement  errors  in  models  studied  in  Chap¬ 
ters  4,  14,  and  20.  Chapter  27  is  a  stand-alone  chapter  on  missing  data  and  multiple 
imputation,  but  its  use  of  the  EM  algorithm  and  Gibbs  sampler  also  gives  it  points  of 
contact  with  Chapters  10  and  13,  respectively. 

Chapter  25  presents  treatment  evaluation.  Treatment  is  a  broad  term  that  refers  to 
the  impact  of  one  variable,  e.g.  schooling,  on  some  outcome  variable,  e.g.  earnings. 
Treatment  variables  may  be  exogenously  assigned,  or  may  be  endogenously  chosen. 
The  topic  of  treatment  evaluation  concerns  the  identifiability  of  the  impact  of  treat¬ 
ment  on  outcome,  as  measured  by  either  the  marginal  effects  or  certain  functions  of 
the  marginal  effect.  A  variety  of  methods  are  used  including  instrumental  variables 
regression  and  propensity  score  matching.  The  problem  of  treatment  evaluation  can 
arise  in  the  context  of  any  model  considered  in  parts  4  and  5.  This  chapter  emphasizes 
the  linear  regression  model,  so  may  be  read  early  on.  However,  it  does  presume  fa¬ 
miliarity  with  many  other  topics  covered  in  the  book,  including  instrumental  variables 
and  selection  models.  For  this  reason  this  topic  of  growing  importance  is  placed  in  the 
last  part  of  the  book. 
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Stratified  and  Clustered  Samples 


24.1.  Introduction 

Microeconometrics  research  is  usually  performed  on  data  collected  by  survey  of  a 
sample  of  the  population  of  interest.  The  simplest  statistical  assumption  for  survey 
data  is  simple  random  sampling  (SRS),  under  which  each  member  of  the  population 
has  equal  probability  of  being  included  in  the  sample.  Then  it  is  reasonable  to  base 
statistical  inference  on  the  assumption  that  the  data  (y,-,  x,)  are  independent  over  i  and 
identically  distributed.  This  assumption  underlies  the  small-sample  and  asymptotic 
properties  of  estimators  presented  in  this  book,  with  the  notable  exception  of  sample 
selection  models  in  Chapter  16. 

In  practice,  however,  SRS  is  almost  never  the  right  assumption  for  survey  data. 
Alternative  sampling  schemes  are  instead  used  to  reduce  survey  costs  and  to  increase 
precision  of  estimation  for  subgroups  of  the  population  that  are  of  particular  interest. 

For  example,  a  household  survey  may  first  partition  the  population  geographically 
into  subgroups,  such  as  villages  or  suburbs,  with  differing  sampling  rates  for  different 
subgroups.  Interviews  may  be  conducted  on  households  that  are  clustered  in  small 
geographic  areas,  such  as  city  blocks.  The  data  (y,-,  x()  are  clearly  no  longer  iid.  First, 
the  distribution  of  (y,-,x,-)  may  vary  across  subgroups,  so  the  identical  distribution 
assumption  may  be  inappropriate.  Second,  since  data  may  be  correlated  for  households 
in  the  same  cluster,  the  assumption  that  (y,,x,)  are  independent  within  the  cluster 
breaks  down. 

The  usual  methods  employed  to  obtain  the  distribution  of  estimators  therefore  need 
to  be  adapted,  and  the  properties  of  estimators  may  depart  from  results  obtained  under 
SRS.  This  is  the  subject  of  this  chapter. 

The  consequences  for  regression  modeling  are  the  following.  First,  weighted  esti¬ 
mators  that  adjust  for  differences  in  sampling  rates  may  be  necessary  if  the  goal  of 
analysis  is  prediction  of  population  behavior.  Second,  such  weighting  is  unnecessary  if 
interest  lies  in  regression  of  y  on  x,  provided  the  conditional  model  for  y  given  x  is  cor¬ 
rectly  specified  and  stratification  is  not  on  the  dependent  variable.  Third,  if  samples 
are  determined  in  part  by  the  value  of  the  dependent  variable,  such  as  an  oversample 
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of  low-income  people  when  income  is  the  dependent  variable,  weighted  estimation 
is  necessary.  A  range  of  estimation  procedures  are  possible,  with  some  presented 
in  Chapter  16  in  the  context  of  sample  selection  bias.  Fourth,  clustering  at  a  mini¬ 
mum  leads  to  standard  error  estimates  that  appreciably  understate  the  true  standard 
errors  and  can  even  lead  to  inconsistent  parameter  estimates  unless  adjustment  is  made 
for  clustering  using  methods  similar  to  those  presented  in  Chapter  21  for  panel  data 
analysis. 

The  most  important  implication  for  most  microeconometrics  applications  using  sur¬ 
vey  data  is  the  need  to  control  for  clustering.  Clustering  of  observations  is  often  found 
in  both  cross-section  and  panel  data,  as  a  consequence  of  (1)  sampling  design,  (2)  de¬ 
sign  of  a  social  experiment,  or  (3)  the  nature  of  the  observation  method.  An  example 
of  (1)  is  a  complex  large-scale  household  survey  in  which  spatial  clusters  of  house¬ 
holds  are  sampled  to  reduce  the  cost  of  surveys.  An  example  of  (2)  is  a  randomized 
social  experiment  in  which  a  common  treatment  is  assigned  to  individuals  in  a  partic¬ 
ular  location  such  as  an  industrial  plant  or  a  school.  Examples  of  (3)  are  regressions 
with  individual  cross-section  data  when  regressors  also  include  group  averages  such 
as  unemployment  or  tax  rates  at  the  state  level,  use  of  panel  data,  and  use  of  siblings 
data  even  if  there  is  no  clustering  of  households. 

Section  24.2  introduces  some  of  the  concepts  and  terminology  of  survey  sampling. 
Sections  24.3-24.5  consider  the  three  key  features  of  survey  data,  respectively,  sam¬ 
ple  weights,  stratihcation,  and  clustering.  Section  24.6  considers  hierarchical  linear 
models  where  both  stratification  and  clustering  are  present.  An  application  to  data  is 
presented  in  Section  24.7.  Complex  surveys  are  considered  further  in  Section  24.8. 


24.2.  Survey  Sampling 

Survey  sampling  has  been  well  researched  in  the  statistics  literature,  since  data  collec¬ 
tion  must  be  done  before  any  analysis,  and  surveying  can  be  very  expensive.  The  goal 
of  the  survey  literature  is  usually  to  obtain  with  minimal  cost  a  sample  that  can  pro¬ 
vide  unbiased  and  reasonably  precise  estimates  of  population  parameters,  especially 
the  population  mean. 

The  structure  of  a  multistage  survey  was  described  in  Section  3.2.  The  U.S.  CPS  is 
a  leading  example  of  such  a  sample  design. 


24.2.1.  Current  Population  Survey 

The  CPS  is  a  monthly  survey  of  approximately  56,000  households  that  is  intended 
to  be  representative  of  the  civilian  noninstitutional  population  16  years  and  older. 
Households  in  smaller  states  are  oversampled  to  provide  more  reliable  state-level 
data.  Within  states  the  surveyed  households  are  clustered  to  reduce  interview  costs. 
Specifically,  households  are  interviewed  in  four  consecutive  months,  rested  for  eight 
months,  and  then  interviewed  for  another  four  months.  Reinterviewing  reduces  sur¬ 
vey  costs  and  the  4—8-4  schedule  permits  some  longitudinal  analysis,  including  one- 
year  differences.  There  are  eight  rotation  groups  of  similar  size,  with  a  new  rotation 
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group  being  introduced  each  month.  We  consider  the  sampling  design  for  one  rotation 
group. 

Specifically,  there  are  792  strata,  where  each  stratum  is  a  subregion  of  a  state  or 
in  some  cases  an  entire  state.  The  792  strata  are  split  into  2,007  PSUs,  where  a  PSU 
may  be  a  metropolitan  statistical  area  (MSA),  a  state-MSA  intersection  if  the  MSA 
covers  more  than  one  state,  a  single  county,  or  two  or  more  contiguous  counties,  with 
departures  from  this  scheme  when  a  PSU  has  low  population  or  large  area.  On  average 
there  are  2.5  PSUs  per  strata.  Of  the  792  strata,  432  contain  only  one  PSU,  in  which 
case  the  PSU  is  called  self-representing  and  is  always  included  in  the  CPS  survey.  The 
other  360  strata  have  more  than  one  PSU,  and  exactly  one  PSU  is  randomly  chosen 
from  the  strata  with  probability  proportional  to  the  1990  population. 

Within  the  PSUs  there  are  no  intermediate  SSUs.  The  survey  directly  samples 
USUs,  a  geographically  compact  group  of  approximately  four  addresses.  The  sam¬ 
pling  probability  increases  if  there  was  low  probability  of  drawing  the  PSU  from  its 
strata  and  usually  increases  if  the  PSU  is  in  a  small  state,  to  permit  oversampling 
in  low-population  states.  (In  this  calculation  New  York  and  Los  Angeles  are  treated 
as  states.)  All  households  in  the  USU  are  surveyed,  unless  the  USU  has  an  unusu¬ 
ally  high  number  of  households,  in  which  case  a  subset  of  households  is  randomly 
chosen. 

The  CPS  is  designed  to  be  self-weighting  by  state  so  that,  despite  the  use  of  nonran¬ 
dom  sampling,  the  CPS  should  provide  a  representative  sample  for  each  state.  How¬ 
ever,  the  unweighted  sample  is  not  nationally  representative  because  of  the  oversam¬ 
pling  of  low  population  states  and  because  not  all  PSUs  are  sampled. 


24.2.2.  Sampling 

Before  moving  to  a  more  detailed  analysis  of  survey  sampling,  we  provide  a  brief 
overview  of  sampling  basics  in  the  absence  of  complications  such  as  stratification. 

Let  z  denote  a  vector  of  variables,  where  there  is  no  need  to  distinguish  between 
dependent  and  regressor  variables.  We  assume  that  in  the  population  the  variable  z  is 
iid  with  density  /( z).  The  population  is  of  size  N*  and  the  sample  is  of  size  N.  The 
sample  is  {z,,  i  =  I ,  N},  where  i  denotes  the  /th  sampling  unit.  The  usual  notation 
in  the  sampling  literature  is  n  for  sample  size  and  N  for  population  size.  We  instead 
continue  to  use  N  for  sample  size  as  there  is  only  occasional  need  to  introduce  the 
population  size  N*. 


Exhaustive  Sampling 

Under  exhaustive  sampling  every  element  of  the  population  is  sampled,  so  the  sample 
is  the  population.  Such  sampling  is  rare  with  individual-level  data.  It  does  happen 
in  a  population  census  such  as  the  U.S.  decennial  census.  Yet  even  for  the  census, 
subsampling  is  used  for  the  lengthier  questionnaires,  researchers  may  prefer  to  work 
with  a  more  manageable  census  subsample,  and  in  practice  the  coverage  of  the  census 
is  incomplete.  Exhaustive  sampling  is  more  common  with  firm-level  data,  where,  for 
example,  all  firms  in  an  industry  may  be  studied. 
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Exhaustive  sampling  can  lead  to  debate  about  whether  the  usual  inferential  methods 
are  warranted,  as  the  sample  moments  then  equal  the  population  moments.  The  usual 
procedure  is  to  still  use  the  usual  inferential  methods.  This  is  done  by  viewing  the  finite 
population  to  be  a  sample  from  an  infinite  superpopulation. 

For  example,  suppose  interest  lies  in  gender  differences  in  salary  at  a  work  site  that 
has  a  population  total  of  20  men  and  12  women  performing  similar  tasks.  Salaries 
are  obtained  for  all  men  and  women  at  the  work  site,  so  the  sample  is  the  population, 
and  mean  salary  is  found  to  be  higher  for  men  than  women.  It  is  customary  to  perform 
conventional  hypothesis  tests  on  the  differences  in  mean  salary,  rather  than  to  conclude 
that  since  the  sample  mean  equals  the  population  mean  there  is  100%  certainty  that 
male  salaries  are  higher.  The  rationale  is  that  the  population  at  this  particular  work  site 
is  viewed  as  a  sample  from  a  superpopulation  of  work  sites,  or  from  a  superpopulation 
of  the  particular  work  site  at  many  points  in  time. 

Exhaustive  sampling  is  expensive  and  is  generally  unnecessary  for  large  popula¬ 
tions  unless  the  actual  population  size  needs  to  be  determined.  Instead,  a  subset  of  the 
population  is  usually  sampled. 


Simple  Random  Sampling 

A  simple  random  sample  is  one  where  observations  are  drawn  from  the  population 
at  random  and  with  equal  probability.  Each  observation  appears  in  the  sample,  with 
probability  equal  to  the  sample  size  divided  by  the  population  size,  and  has  the  same 
marginal  density  /( z).  The  prefix  “simple”  is  added  because  more  systematic  sampling 
methods  still  usually  have  a  random  element. 


Finite-Sample  Correction 

Most  econometric  analysis  presumes  that  SRS  leads  to  draws  of  z  that  are  independent, 
so  the  joint  density  of  the  sample  under  SRS  is  the  product  of  the  individual  densities 
/(z,-).  This  is  reasonable  if  the  SRS  is  obtained  from  an  infinite  population,  as  is  the 
case  if  sampling  is  viewed  to  be  from  a  superpopulation,  or  if  it  is  obtained  from  a 
finite  population  and  sampling  is  with  replacement. 

In  practice  for  finite  populations  an  SRS  is  obtained  without  replacement,  to  en¬ 
sure  that  the  same  observation  does  not  appear  in  the  sample  twice.  Then  observations 
are  no  longer  independent,  even  under  SRS.  To  see  this,  note  that  under  SRS  the  prob¬ 
ability  of  any  particular  member  of  the  population  appearing  in  the  sample  is  N/N*. 
Given  knowledge  that  this  member  appears  in  the  sample,  however,  the  probability 
of  any  other  member  appearing  in  the  sample  falls  to  ( N  —  I  )/(N*  —  1).  Clearly,  the 
conditional  probability  differs  from  the  unconditional  probability.  More  formally,  one 
introduces  indicator  variables  for  whether  each  case  in  the  population  appears  in  the 
sample.  These  indicator  variables  are  joint  multinomial  distributed  with  means  n,  vari¬ 
ances  7r(l  —  7r),  and  covariances  —  jr(  1  —  n)/(N*  —  1),  where  it  =  N/N*. 

The  correlation  between  sample  observations  is  p  =  —  1  /(TV*  —  1),  where  p  is 
called  the  intraclass  correlation.  Letting  z  be  a  scalar,  we  have  that  the  sample 
mean  2  =  N~l  n  has  variance  V[z]  =  A%2V[JT  Zi\,  which  does  not  simplify  to 
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N  2  £.V[Zj]  owing  to  the  correlation  of  the  z,.  Some  algebra  given,  for  example,  in 
Cochran  (1977,  pp.  23-24)  yields 

vm  =  d-/)^, 

where  /  =  N/N*  is  the  sampling  fraction,  and  results  in  this  literature  are  usually 
simpler  to  express  in  terms  of  S 2  =  (N*  —  l)”1  ^(Zi  —  z )2  rather  than  the  usual  finite 
population  variance  a 2  =  N*~l  z,i  —  z)2- 

Thus  for  sampling  without  replacement  from  a  finite  population,  the  variance  of  the 
sample  mean  equals  the  usual  S2/N  multiplied  by  the  finite-sample  correction  term 
1  —  /.  This  correction  term  appears  in  statistical  packages  for  survey  data.  Failure  to 
allow  for  the  finite-sample  correction  term  leads  to  conservative  statistical  inference 
as  V[z]  will  be  overestimated.  For  regression  using  data  from  SRS  with  replacement, 
a  finite-sample  correction  is  similarly  relevant,  though  the  extent  and  direction  of  bias 
in  the  variance  of  the  OLS  estimator  now  additionally  depends  on  the  design  matrix. 

The  finite-sample  correction  term  is  usually  ignored  in  microeconometrics.  This 
is  often  reasonable.  For  example,  for  household  survey  data  the  sample  size  is  small 
relative  to  population  size  so  that  /  =  N/N*  0. 


24.3.  Weighting 

Flousehold  surveys  such  as  the  CPS  are  usually  constructed  in  a  way  that  leads  to 
different  households  having  different  probabilities  of  inclusion  in  the  sample.  Sample 
weights  are  assigned  to  each  observation  to  correct  for  this. 

As  explained  in  the  following,  provided  stratification  is  exogenous,  weights  should 
be  used  if  regression  is  viewed  as  a  tool  to  describe  population  responses  but  need  not 
be  used  if  the  regression  model  is  assumed  to  be  the  correct  structural  model. 


24.3.1.  Sample  Weights 

Suppose  each  household  in  the  population  has  a  probability  7T,  of  appearing  in  the 
sample  and  assume  that,  unlike  SRS,  this  probability  varies  across  households. 

Statistics  such  as  overall  sample  means  that  give  equal  weight  to  all  observations 
will  then  tend  to  give  too  much  weight  to  households  that  appear  with  high  probability 
in  the  sample.  This  can  be  corrected  by  weighting,  using  sample  weights  that  are 
inversely  proportional  to  the  probability  of  inclusion  in  the  sample: 


Wi  CX  1/7 T/.  (24.1) 

For  example,  instead  of  z  =  N~  1  JT  Zi  we  may  use  the  weighted  mean 

ZW  —  N~l  ^2  wiZi/ ^2  W‘- 
i  i 

Note  that  all  that  matters  in  (24. 1)  is  proportionality.  The  weights  need  not  sum  to  one, 
provided  we  divide  by  the  sum  of  the  weights.  A  common  scaling  is  JT  u>,  =  N*, 
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in  which  case  a  weight  of  w,  means  that  the  observation  represents  w,  households 
in  the  population.  Note  that  care  is  needed  in  using  weights.  Some  references  in¬ 
stead  define  w,  oc  jr,-,  and  some  computer  packages  compute  the  weighted  mean  as 
/wi)/  JT(  I /iVj).  It  is  easy  to  incorrectly  weight  by  the  reciprocal  of  the  sample 
weights. 

For  an  SRS  of  size  N  from  a  finite  population  of  size  N*,  7tj  =  l/N*,  so  u;,  is 
constant  and  zw  =  z. 

For  simple  stratified  sampling  with  SRS  within  strata,  suppose  it  is  known  that  a 
fraction  Hs  of  the  population  size  N*  is  in  strata  s  and  that  Ns  observations  are  from 
the  ,v t h  strata.  Then  7r,  =  NS/HSN*.  It  follows  that  the  sample  weights  u>,  oc  Hs/Ns. 

For  two-stage  sampling  without  stratification  let  nc  be  the  probability  that  the 
cth  PSU  is  selected  and  itjC  be  the  probability  that  household  j  is  selected  in  PSU 
c.  Then  the  sample  weights  WjC  oc  l/(ncNcnjCN),  where  Nc  is  the  number  of  survey 
households  in  the  cth  PSU  and  N  =  Nc.  A  two-stage  sample  is  self- weighting  if 
the  sampling  probabilities  at  each  stage  are  proportional  to  population  numbers,  so 
nc  =  N*/N*  and  jtjc  =  l/N*,  where  N*  is  the  population  size  for  the  cth  PSU.  Then 
the  weights  wJC  are  equal  as  in  SRS,  though  estimator  standard  errors  may  still  have  to 
be  adjusted  for  the  two-stage  sampling  as  shown  in  Section  24.8. 

For  the  CPS,  which  oversamples  households  in  small  states,  it  would  appear  suffi¬ 
cient  to  use  wi  oc  Hs/Ns,  where  s  denotes  state.  The  CPS  uses  this  as  a  baseweight,  but 
then  adjusts  for  subsampling  within  the  USU  if  the  USU  has  too  many  households.  A 
further  complication  is  that  not  all  PSUs  in  a  strata  are  surveyed;  consequently,  the  sur¬ 
veyed  households  in  a  strata  may  not  be  representative  of  the  strata  if  the  sampled  PSUs 
differ  considerably  from  strata  norms.  This  leads  to  two  additional  adjustments.  First, 
adjust  for  unrepresentative  racial  (black/nonblack)  composition  at  the  strata  level.  Sec¬ 
ond,  adjust  weights  to  ensure  that  sample  estimates  for  key  subgroups  (formed  by  state, 
race,  sex,  and  age)  match  independent  population  data.  For  details  see  U.S.  Bureau  of 
Census  (2002).  The  CPS  sample  weights  are  constructed  to  permit  the  CPS  to  provide 
nationally  representative  statistics,  controlling  for  the  composition  of  the  CPS  differ¬ 
ing  from  that  of  the  U.S.  civilian  population  on  the  dimensions  of  state,  race,  sex, 
and  age. 

The  actual  computation  of  sample  weights  for  multistage  surveys  involves  estima¬ 
tion  procedures  that  can  be  quite  complicated.  The  weights  can  be  misestimated;  even 
if  they  are  correctly  estimated  they  may  take  into  account  only  some  of  the  dimensions 
of  sample  nonrepresentativeness. 


24.3.2.  Weighted  Regression 

Should  one  perform  weighted  regression  when  sample  weights  are  provided?  We  con¬ 
sider  this  issue  in  detail  when  the  stratification  is  not  on  the  dependent  variable.  Strat¬ 
ification  on  the  dependent  variable  is  examined  in  Section  24.4. 

Consider  estimation  of  the  linear  regression 

yi=x'i/3+ui,  (24.2) 
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given  survey  data  with  sampling  weights  Wj.  Two  possible  estimators  are  OLS, 

3ols  =  (X'XT'X'y,  (24.3) 

and  WLS  using  the  sampling  weights, 

3wls  =  (X'WXr'X'Wy,  (24.4) 

where  W  =  Diag[u>,]. 


Correctly  Specified  Conditional  Mean 

The  OLS  estimator  is  appropriate  if  it  is  assumed  that  E[u|x]  =  0  so  that  the  condi¬ 
tional  mean  is  linear  in  x. 


E[y;lx;]  =  x-/3.  (24.5) 

Then  OLS  is  consistent  for  f3.  Furthermore,  it  is  second-moment  efficient  by  the 
Gauss-Markov  Theorem  if  the  errors  w,  are  homoskedastic.  The  WLS  estimator  is 
also  consistent  for  (3  under  these  assumptions  but  will  be  inefficient  if  errors  are  homo¬ 
skedastic  (since  the  weighting  in  (24.5)  controls  for  unrepresentativeness  of  the  sample 
rather  than  heteroskedasticity). 


Incorrectly  Specified  Conditional  Mean 

In  many  applications  (24.5)  does  not  hold.  Examples  include  cases  with  omitted  re¬ 
gressors  or  situations  when  E[y|x]  is  nonlinear  in  x  or  E[y,  |x(]  =  x'/3,  where  some 
components  of  /3,  are  correlated  with  x, .  Linear  regression  can  still  be  interpreted  as 
the  best  linear  prediction  of  y  given  x  under  squared  error  loss,  though  this  needs  to  be 
adapted  to  allow  for  unrepresentative  sampling. 

In  the  population,  (_y, .  x,  )  are  iid,  and  from  Section  4.2  we  can  always  write 

yt  =  x-/3*+  m;, 

where  E[  u  \  =  0  and  Cov[x,m]  =  0  and 

/3*  =  (E[xx'])_1  E[xy]. 

Note  that  it  is  no  longer  assumed  that  E[u|x]  =  0,  so  it  is  possible  that  E[y[x]  ^  x'(3. 

The  parameter  (3*  is  called  the  census  coefficient  by  DuMouchel  and  Duncan 
(1983).  It  is  the  probability  limit  of  the  regression  coefficient  that  would  be  obtained 
by  regression  of  y  on  x  using  the  entire  population  rather  than  an  unrepresentative 
sample. 

If  the  conditional  mean  is  nonlinear  in  x  and  the  sample  is  unrepresentative  of  the 
population,  then  the  OLS  estimator  generally  does  not  converge  to  (3*,  since  with  un¬ 
representative  samples  (V^X'X  does  not  converge  to  the  population  moment  E[xx'] 
and  similarly  for  (W'X'y.  Intuitively,  if  the  conditional  mean  is  nonlinear  in  x  then 
there  is  no  reason  to  believe  that  linear  regressions  using  quite  different  surveys  of  the 
same  population  will  yield  the  same  OLS  estimates. 
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However,  WLS  using  sample  weights  may  consistently  estimate  (3* .  Specifically,  if 
the  weighting  matrix  W  is  such  that 

AT'X'WX  4  E[xx'],  (24.6) 

AT'X'Wy  4  E[xy], 

then  /3wls  defined  in  (24.4)  converges  to  (3*. 


Simple  Stratified  Samples 

Much  of  the  analysis  of  weighted  LS  estimation  is  presented  for  simple  stratified  sam¬ 
pling  with  SRS  within  strata.  Then  it  is  clear  that  (24.6)  is  satisfied  with  iu,  cc  Hs/Ns 
if  the  / th  interviewed  household  is  in  the  vth  strata. 

This  literature  also  considers  the  possibility  of  different  regression  parameters 
within  strata.  It  is  assumed  that  E[y,  |x,]  =  x'j(3s  for  household  i  in  strata  s.  The  goal 
may  be  to  estimate  the  population-weighted  parameter  /3W  =  IV-1  ^  N*/3S.  Then  in 
general  neither  OLS  nor  WLS  converge  to  /3W,  unless  the  (3S  are  equal  across  strata  or 
are  iid  with  constant  mean  across  strata.  A  notable  exception  to  this  result  is  estimation 
of  the  mean  of  y  (regression  with  x  =  1),  in  which  case  the  weighted  average  of  the 
strata  sample  means  is  unbiased  for  the  population  mean.  For  details  see  Section  24.4. 1 
and  DuMouchel  and  Duncan  (1983),  Deaton  (1997),  or  Ullah  and  Breunig  (1998). 


Should  One  Use  Sample  Weights? 

The  preceding  analysis  can  be  used  to  answer  the  question  of  whether  to  use  sample 
weights  in  estimation,  assuming  there  is  no  endogenous  stratification.  The  discussion 
considers  estimation  of  (possibly  nonlinear)  models  of  E[y|x],  but  it  also  applies  to 
models  of  any  other  feature  of  the  conditional  distribution  of  y  given  x  such  as  the 
median  or  the  density. 

If  one  takes  a  structural  or  analytical  approach  and  assumes  that  the  model  of 
E[  v|x]  is  correctly  specified,  there  is  no  need  to  use  sample  weights.  Results  can  be 
used  to  analyze  effects  of  changes  in  x  on  E[y|x]. 

If  one  instead  takes  a  descriptive  or  data  summary  approach  then  weights  should 
be  used.  Regression  is  then  interpreted  as  estimating  census  coefficients.  A  major 
caveat,  however,  is  that  in  complicated  surveys  it  is  not  possible  to  compute  weights 
that  so  clearly  satisfy  (24.6)  as  was  the  case  for  stratified  sampling  with  SRS  within 
strata.  In  practice  sampling  weights  are  constructed  to  match  population  proportions 
for  some  subgroups  based  on  age,  sex,  and  race.  There  is  no  guarantee  that  such 
weights  will  satisfy  (24.6). 

Some  data  sets,  such  as  relatively  small  longitudinal  surveys  of  a  few  thousand 
households,  are  developed  with  a  structural  modeling  approach  in  mind.  Nonetheless, 
they  usually  attempt  to  provide  a  reasonably  representative  sample  of  the  population 
while  using  clustered  sampling  to  keep  down  survey  costs.  Other  data  sets,  such  as 
the  CPS,  are  designed  to  provide  accurate  descriptive  measures  such  as  national  and 
regional  estimates  of  unemployment  rates.  Here  designers  of  the  survey  are  taking  a 
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census  approach  and  indeed  would  prefer  a  monthly  census  if  it  were  not  so  expensive 
to  conduct. 

For  either  sort  of  data  set  microeconometricians  usually  strive  to  take  the  structural 
modeling  approach.  As  an  example,  consider  regression  of  earnings  on  schooling 
level  and  socioeconomic  characteristics  such  as  age,  sex,  and  race,  but  not  measures 
of  innate  ability. 

Most  econometricians  would  only  give  a  descriptive  interpretation  to  the  coefficient 
of  schooling  in  an  OLS  regression  because  of  the  endogeneity  of  schooling.  The  in¬ 
terpretation  then  is  that  if  we  hold  certain  key  regressors  constant,  one  more  year  of 
schooling  is  associated  with,  but  does  not  necessarily  cause,  a  6%  increase,  say,  in 
earnings.  Here  using  sample  weights  in  OLS  regression  is  appropriate  to  permit  esti¬ 
mates  to  be  interpreted  as  measuring  associations  in  the  population,  rather  than  merely 
those  in  a  possibly  unrepresentative  sample.  Even  though  no  causal  interpretation  is 
possible,  this  estimate  can  be  useful  as  it  does  measure  how  income  varies  across  ed¬ 
ucational  groups  after  controlling  for  some  other  key  socioeconomic  variables.  After 
all,  a  major  goal  of  statistics  is  data  summary. 

A  consistent  estimate  of  the  schooling  coefficient  may  be  obtained  using  more  ad¬ 
vanced  estimation  methods,  such  as  instrumental  variables  or  panel  data  methods. 
Then  the  coefficient  can  be  given  a  causal  interpretation.  Weighting  by  sample  weights 
is  no  longer  necessary,  though  the  usual  weighting  to  improve  efficiency  if,  for  exam¬ 
ple,  errors  are  heteroskedastic,  may  be  appropriate. 

Whether  a  model  can  be  interpreted  as  correctly  specified  is  a  judgement  call.  If  it 
is  correctly  specified  then  sample  weighted  and  unweighted  estimates  should  have  the 
same  probability  limit,  since  both  are  consistent.  This  suggests  testing  correct  model 
specification  by  a  Hausman  test  of  the  difference  between  sample-weighted  and  un¬ 
weighted  regressors,  a  test  proposed  by  DuMouchel  and  Duncan  (1983)  in  the  case  of 
linear  regression. 


24.3.3.  Prediction 


Consider  nonlinear  regression  with  correctly  specified  conditional  mean,  g(x,  (3),  and 
no  endogeneity.  The  unweighted  NLS  estimator  consistently  estimates  j3  and  can  be 
given  a  causal  interpretation.  In  particular,  we  can  use  3 g(x,  /3)/3x  to  calculate  the 
causal  effect  of  a  one-unit  change  in  x  of  the  conditional  mean. 

This  predicted  effect  varies  with  the  evaluation  point  x,  since  g(-)  is  nonlinear.  An 
estimate  of  the  average  response  in  the  population  is 


E 


dy 

dx 


N 

x> 

1=1 


3  g(X;3) 

dXj 


where  in,  are  the  sample  weights.  Similarly,  if  one  instead  evaluates  the  response  at 
the  mean  of  the  regressors  it  may  be  better  to  use  the  weighted  sample  mean  of  x,  an 
estimate  of  the  population  mean  of  x,  rather  than  the  unweighted  sample  mean  of  x. 

Even  if  the  parameters  are  consistently  estimated  using  unweighted  estimation, 
weighting  must  be  used  in  subsequent  impact  calculations  if  one  wishes  to  predict 
population  impacts,  rather  than  sample  impacts. 
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24.4.  Endogenous  Stratification 

Stratification  is  widely  used  as  it  can  increase  precision  of  estimation,  or  equivalently 
reduce  survey  costs  for  a  given  level  of  precision.  For  example,  more  precise  esti¬ 
mation  of  the  mean  unemployment  rate  in  low-population  states  may  be  obtained  by 
oversampling  poor  states.  For  similar  reasons  minority  groups  may  be  oversampled. 

One  complication,  already  considered  in  Section  24.3,  is  that  parameters  may  vary 
across  strata.  For  example,  the  mean  unemployment  rate  may  vary  across  strata.  Then 
a  descriptive  approach  is  taken  and  weighted  estimators  are  used. 

Microeconometricians  often  prefer  a  structural  approach  and  assume  parameters  are 
unchanging  across  strata.  Then  from  Section  24.3  stratification  apparently  causes  no 
complications  and  unweighted  regression  is  used.  A  major  proviso  is  that  problems 
still  arise  if  stratification  is  based  on  the  value  of  the  dependent  variable.  For  example, 
if  low-income  people  are  purposely  oversampled  and  income  is  the  dependent  variable 
then  the  usual  regression  estimators  are  inconsistent.  Note  that  there  is  no  problem  if 
stratification  is  on  regressors  such  as  race  and  this  leads  indirectly  to  oversampling  of 
low-income  people.  Problems  only  arise  if  stratification  is  directly  on  income. 

In  this  section  we  define  endogenous  stratification  and  analyze  the  resulting  com¬ 
plications.  We  then  present  several  estimators  that  are  consistent.  The  simplest  is  a 
weighted  estimator  that  can  be  used  if  both  the  sample  and  population  strata  probabil¬ 
ities  are  known.  The  method  is  given  in  Section  24.4.5,  which  is  self-contained. 


24.4.1.  Stratification  Schemes 


For  general  data  ze2  the  strata  are  subsets  of  Z.  Econometric  analysis  usually  par¬ 
titions  the  data  into  dependent  variable  y  e  X  where  for  generality  we  allow  y  to  be 
a  vector,  and  regressor  or  independent  variable  x  e  X.  The  strata  Cs,  for  .v  =  I ..... .S’, 

are  then  defined  to  be  subsets  of  the  sample  space  y  x  X.  The  notation  is  that  used  by 
Imbens  and  Lancaster  ( 1996),  who  present  some  leading  examples  that  are  reproduced 
in  Table  24. 1 . 

Sampling  within  strata  is  assumed  to  be  random  but  some  strata  may  be  oversam¬ 
pled.  From  Table  24.1  it  is  clear  that  the  strata  may  sum  to  less  than  the  sample  space 
or  more  than  the  sample  space.  For  the  fourth  and  fifth  schemes  the  stratification  may 
be  solely  on  endogenous  variables,  solely  on  exogenous  variables,  or  on  a  mixture  of 
the  two. 

The  econometrics  literature  has  focused  on  sampling  schemes  with  an  endogenous 
component,  since  in  that  case  the  usual  conditional  MLE  is  inconsistent. 

Endogenous  stratification  has  already  been  considered  in  Chapter  16.  As  an  exam¬ 
ple,  consider  truncated  regression,  where  we  observe  y  only  if  y  >  0,  so  stratification 
is  purely  on  y.  Then  for  sampled  data  the  conditional  density  of  y  given  x  is  a  zero- 
truncated  density  that  divides  the  untruncated  density  by  Pr[y  >  0|x]  and  so 


rcyix,  e)  = 


/(y|x,  e ) 

i  -  e(0|x,  ey 
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Table  24.1.  Stratification  Schemes  with  Random  Sampling  within  Strata 


Stratification 

Scheme 

Definition 

Stratum  Description 

Simple  random 

s=  i,  Ci  =y  x  a 

One  stratum  covers  entire  sample 
space. 

Pure  exogenous 

Cs  =  y  X  Xs,  with  xs  c  x 

Stratify  on  regressors  only, 
not  on  dependent  variable 

Pure  endogenous 

Cs=ysxX,  with 34  cT 

Stratify  on  dependent  variable  only, 
not  on  regressors 

Augmented  sample 

S  =  2,  Cl  =  y  x  A, 
and  C2  C  y  x  X. 

Random  sample  augmented  by  extra 
observations  from  part  of  the 
sample  space 

Partitioned 

Cs  C  y  X  A,  csn  C,  =  0, 

and  U  Cs  =  ysxX. 

s=  1 

Sample  space  split  into  mutually 
exclusive  strata  that  fill  the  entire 
sample  space 

where  the  superscript  s  is  used  to  distinguish  the  sample  density  from  the  population 
density  f(y |x,  9).  As  discussed  in  Chapter  16,  this  sampling  scheme  tends  to  drop 
observations  with  low  realizations  of  y,  given  x.  Suppose  E[y|x]  =  fi\+  finx  and  fin_  > 
0.  Then  for  low  values  of  x  there  will  be  too  many  relatively  high  values  of  y.  The 
regression  will  accordingly  overpredict  E[y[x]  for  low  values  of  x,  leading  to  upward 
bias  in  the  intercept  ji\  and  downward  bias  in  the  slope  /T. 

A  second  example  is  choice-based  sampling  for  binary  or  multinomial  data  where 
samples  are  chosen  based  on  the  discrete  outcome  y.  For  example,  if  choice  is  between 
travel  to  work  by  bus  or  travel  by  car  we  may  oversample  bus  riders  if  relatively  few 
people  commute  by  bus.  This  example  is  pursued  in  the  following.  It  is  similar  to 
case-control  studies  in  the  medical  literature  where,  for  example,  a  complete  sample 
of  people  who  died  from  a  disease  (y  =  1)  is  contrasted  with  a  similar-sized  subsample 
of  the  universe  of  people  who  did  not  die  of  the  disease  (y  =  0).  The  goal  is  to  find 
whether  one  or  more  regressors  are  able  to  predict  y  =  1 . 

A  related  example  is  count  data  on  number  of  visits  collected  by  on-site  sampling 
of  users,  such  as  sampling  at  recreational  sites  or  shopping  centers  or  doctor’s  offices. 
Then  data  are  truncated,  since  those  with  y  =  0  are  not  sampled,  and  additionally 
high-frequency  visitors  are  oversampled.  Shaw  (1988)  shows  that  the  sampling  dis¬ 
tribution  of  the  data,  /s(y|x,  6),  is  related  to  the  population  distribution  through  the 
equation 


/s(y|x,  fl)  =  /(y|x,fl)-  -V 

E[y|x,  6] 

In  this  case  the  sampling  scheme  is  clearly  endogenous  though  it  is  not  a  stratified 
sampling  scheme. 
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24.4.2.  Endogeneity  Induced  by  Stratification 

Sampling  schemes  such  as  stratification  schemes  lead  to  the  density  in  the  sample 
differing  from  that  in  the  population.  If  stratification  is  purely  exogenous,  then  despite 
this  difference  the  usual  MLE  is  still  consistent  because  the  conditional  density  of  y 
given  x  in  the  sample  is  the  same  as  that  in  the  population.  However,  if  any  aspect  of 
stratification  is  endogenous,  then  these  conditional  densities  differ,  as  illustrated  by  the 
preceding  examples.  We  now  provide  a  detailed  discussion  of  this  point. 

The  goal  of  ML  estimation  lies  in  consistently  estimating  the  parameters  0  of 
/(y|x,  6).  In  general  the  MLE  should  be  based  on  the  likelihood  formed  from  the 
joint  distribution  of  the  data  (y,  x).  In  practice  it  is  often  sufficient  to  simply  form 
a  conditional  likelihood  from  the  conditional  distribution  of  y  given  x.  This  simpler 
approach  can  lead  to  consistent  estimation  under  the  assumption  that  x  is  exogenous 
with  respect  to  y,  in  which  case  the  joint  density  factorizes  as 

g(y,  x|0)  =/(y|x,  9)  x  h(x),  (24.7) 

where  the  parameters  of  the  density  of  x  are  suppressed  as  there  is  no  desire  to  estimate 
these  parameters. 

It  is  always  the  case  that  we  can  write  g(y,  x)  =f(y\x)xh(x).  The  assumption  made 
in  (24.7)  is  that,  upon  introduction  of  parameters,  9  appears  in  /(y|x,  9)  but  does  not 
appear  in  h(x).  In  general,  rather  than  (24.7)  we  may  have 

g(y,  x| 9)  =f(y\x,  9)  x  /i(x|0).  (24.8) 


Then  one  or  more  components  of  x  are  endogenous  with  respect  to  y  since  there  is 
now  feedback  -  y  depends  on  x  but  x  in  turn  depends  on  y  via  the  presence  of  9  in 
/?(x|0).  A  classic  example  of  this  is  linear  simultaneous  equations.  In  such  cases  ML 
estimation  should  be  based  on  the  joint  likelihood 

n  n 

In  LJOint(0)  =  ln  /0's  lXi  -  6)+  H ln  h(x‘ \6)-  (24.9) 

i=i  f=i 


This  yields  a  consistent  estimate  of  9  if,  from  Chapter  5, 


"3  lng(y,  x|0)~ 

—  F 

"3  In  / (y|x,  0)' 

+  E 

"3  ln/;(x|0)" 

30 

—  E 

30 

30 

(24.10) 


Condition  (24.10)  is  satisfied  if  the  density  g(y,x|0)  is  correctly  specified  and  the 
range  of  the  data  does  not  depend  on  9.  The  conditional  MLE  instead  maximizes  the 

conditional  likelihood 


In LCond(0)  =  ln  /Cb'lxo  0)- 

i 


The  conditional  MLE  is  consistent  if  E[3  ln  /(y|x,  0)/30]  =  0.  This  necessary  con¬ 
dition  is  implied  by  (24.10)  if  x  is  exogenous,  since  (24.10)  simplifies  because  then 
3  In  Ii(x)/'d9  =  0.  If  instead  x  is  endogenous  this  simplification  does  not  occur,  as  the 
second  term  on  the  right-hand  side  of  (24.10)  does  not  disappear.  So  the  conditional 
MLE  is  inconsistent  if  x  is  endogenous. 
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The  problem  that  arises  with  stratification  and  similar  sampling  schemes  is  that 
even  if  the  population  joint  density  satisfies  (24.7)  and  is  the  same  across  strata,  the 
sampling  schemes  can  lead  to  joint  density  for  (y,  x)  in  the  sample  that  takes  the  more 
general  form 


gs(y,x|0)=/s(y|x,0)x  hs(x \0),  (24.11) 

where  the  superscript  s  is  used  to  denote  dependence  on  the  particular  sampling 
scheme  employed.  Then  the  conditional  MLE  may  be  inconsistent,  even  though  it 
would  be  consistent  if  the  sample  was  instead  an  SRS. 

Under  pure  exogenous  sampling  the  only  difference  between  sample  and  popula¬ 
tion  distribution  occurs  for  the  marginal  distribution  of  x.  Assuming  (24.7)  holds  in 
the  population,  then  in  the  sample 

gs(y,x|0)  =/(y|x,  9)  x  /zs(x). 

Clearly,  the  conditional  MLE  will  be  consistent  as  the  conditional  density  is  still 
f(y  |x,  9)  and  9  does  not  appear  in  lis(x). 

Under  endogenous  sampling  the  more  general  result  (24. 1 1 )  holds  in  the  sample 
even  if  (24.7)  holds  in  the  population.  The  sample  and  population  conditional  distribu¬ 
tions  of  v  given  x  may  differ,  with  fs(y\x,  9)  ^f(y |x,  9),  and  hs(x\9)  may  possibly 
depend  on  9. 


24.4.3.  Endogenous  Sampling 

Under  pure  endogenous  sampling  the  marginal  distribution  of  y  in  the  sample  differs 
from  that  in  the  population.  Let  h(y)  denote  the  population  density  of  y  and  hs(y) 
denote  the  sampling  density  of  y.  (We  are  using  the  convention  that  g,  /,  and  h  denote, 
respectively,  joint,  conditional,  and  marginal  distributions.  It  should  be  clear  to  the 
reader  that  li(y)  differs  from  h(x).) 

The  joint  distribution  of  y  and  x  under  pure  endogenous  sampling  is  best  obtained 
by  first  conditioning  on  x,  rather  than  y.  Then 

<?s(y>x)  —f  (x\y)hs(y),  (24.12) 

where  simplification  has  occurred  because  the  conditional  distribution  of  x  given  y  is 
unaffected  under  pure  endogenous  sampling  and  so  /s(x|  y)  =  /(x|y).  We  now  need 
to  reexpress  f(x\y)  in  terms  of  /(y|x).  Now 

g(y,  x) 

f(x\y )  =  (24.13) 

h(y) 

_  f(y  |x)/i(x) 

h(y) 

Substituting  (24.13)  into  (24.12)  and  rearranging  yields 

gs(y,  x|0)  =/(y|x,  G)  x  ('^  x  h(x), 
h(y\G) 
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where 

h(y\0)  =  J  g(y,  x\9)clx 

—  J  f(y  |x,  0)h(x)dx. 

The  conditional  MLE  using  just  f(y  |x,  0)  will  be  inconsistent  because  the  term  h(y\0) 
has  been  neglected.  One  instead  needs  to  maximize  a  joint  likelihood  that  additionally 
includes  h(y\6). 


24.4.4.  Endogenously  Stratified  Samples 


We  now  consider  the  stratification  schemes  introduced  in  Section  24.4. 1 .  The  popula¬ 
tion  density  is 

g(j,x|0)  =/(y|x,  0)h(x). 

There  are  S  strata  where  the  sth  strata  is  the  subset  Cs  of  y  y.  X. 

An  important  distinction  is  made  between  the  population  probability  of  an  observa¬ 
tion  being  in  Cs  and  the  probability  of  sampling  from  Cs,  as  the  two  differ  in  a  stratified 
sampling  scheme.  We  define 

Hs  —  Pr[Draw  an  observation  from  Cs],  ^  ^ 

Qs(0)  =  Pr[A  randomly  drawn  observation  from  the  population  is  in  Cs], 

Here  Hs  is  set  by  the  sample  design,  whereas 

QAO)  =  J  /(Tlx,  0)h(x)dydx.  (24.15) 

The  strata  probabilities  may  or  may  not  be  known.  A  strata  is  oversampled  if  Hs  >  Qs. 

We  begin  by  obtaining  the  joint  density  of  s,  y,  and  x,  where  s  is  an  indicator  for 
the  stratum  from  which  the  observation  was  obtained.  In  the  population 

gOs\  y,  x\6)  =  QAO)g(y,  x\s,  0). 


In  the  sample,  the  marginal  distribution  of  the  strata  indicator  differs  from  Qs,  and 

gs(,y,  y,  x\0)  =  Hsg(y,  x\s,  0) 

,,  f(y |x,  0)h(x) 

—  Ji  r  . 

Qs{0) 

where  the  second  equality  holds  as  g(y,  x | ,v )  equals  the  density  g(y,  x)  =  f(y\x)h(x) 
divided  by  the  population  probability  of  being  in  strata  s  so  that  the  density  integrates 
over  Cs  to  one. 

It  follows  that  the  joint  density  is 

gs(s,  y,  x\0)  =-/^-/(y|x,  0)h(x),  (24.16) 

\ls\P) 

where  Qs{0)  is  defined  in  (24.15).  The  conditional  MLE  based  on  the  population  con¬ 
ditional  density  f(y |x,  0)  will  be  inconsistent  for  0  since  it  ignores  the  term  Qs(0), 
which  depends  on  0. 
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A  variety  of  consistent  estimators  have  been  proposed.  Here  we  consider  maximum 
likelihood  estimation,  GMM  estimation,  and  a  much  simpler  weighted  estimator  that 
can  be  implemented  provided  both  strata  sampling  probabilities  Hs  and  population 
probabilities  Qs(9)  are  known. 

Maximum  Likelihood  Estimation 

Performing  an  ML  estimation  based  on  the  joint  density  gs(s,  y,x|0)  in  (24.16)  is 
complicated  because  from  (24.15)  the  distribution  of  Qs{9)  depends  on  h(x).  One 
possible  solution  is  to  specify  the  density  h(x).  This  approach  is  not  taken  because 
econometricians  shy  away  from  specifying  the  distribution  of  regressors,  even  if  there 
is  a  willingness  to  specify  the  conditional  distribution  of  the  dependent  variable. 

Instead,  a  semiparametric  approach  is  taken,  with  the  goal  of  estimating  the  pa¬ 
rameters  of  the  specified  density  f(y  |x,  9),  for  an  unspecified  density  /fix).  For  sim¬ 
plicity  assume  the  population  strata  probabilities  Hs  are  known.  Cosslett  (1981a)  ob¬ 
tained  the  MLE  with  endogenous  stratification  by  first  letting  x  be  discrete  with 
x,  occurring  with  probability  uy,  and  maximizing  the  joint  likelihood  with  respect  to 
9  and  w,  ,  i  =  1, . . . ,  N.  The  first-order  conditions  can  be  collapsed  to  yield  a  con¬ 
centrated  likelihood  that  involves  only  (q  +  S  —  1)  parameters  9  and  functions  A ,s{9), 
s  =  l,.. .,5  —  1.  Second,  maximizing  this  concentrated  likelihood  with  respect  to  9 
and  Xs  yields  the  same  estimates  as  maximization  with  respect  to  9  and  ks(9).  Third, 
since  it  is  valid  to  treat  Xs  as  a  parameter  the  same  procedure  can  be  used  for  the  case  of 
continuous  regressors.  A  problem  of  dimension  q  plus  infinite-dimensional  unknown 
density  h(x)  has  been  reduced  to  q  +  S  —  1  dimensions. 

GMM  Estimation 

The  remarkable  results  of  Cosslett  (1981a)  are  difficult  to  implement. 

Imbens  (1992)  devised  a  simpler  GMM  estimator  with  endogenous  stratifica¬ 
tion  that  has  the  same  efficiency  as  Cosslett’s  MLE.  A  quite  general  framework  and 
presentation  of  this  estimator  is  given  by  Imbens  and  Lancaster  (1996),  for  stratified 
samples  obtained  by  multinomial  sampling,  standard  stratified  sampling,  or  variable 
probability  sampling.  The  joint  density  is  again  gs(s,  y,  x\  9)  in  (24.16)  and  the  sample 
strata  probabilities  Hs  are  permitted  to  be  possibly  unknown.  The  GMM  analysis  is 
based  on  S  —  1  equations  for  the  score  of  Hs,  q  equations  for  9  based  on  the  condi¬ 
tional  likelihood  function  of  y  given  5  and  x,  S  —  I  equations  for  the  restrictions  on 
the  population  strata  probabilities  Qs{9),  and  a  final  restriction  that  is  not  necessary  if 
there  is  a  linear  restriction  on  the  Qs(9),  which  happens,  for  example,  if  the  strata  are 
mutually  exclusive  and  cover  the  sample  space. 

24.4.5.  Weighted  Estimation 

Endogenous  stratification  is  easily  dealt  with  when  the  sample  and  population  strata 
probabilities,  Hs  and  Qs{9)  defined  in  (24.14),  are  known,  though  the  estimator  is 
not  fully  efficient.  We  begin  with  ML  estimation  before  considering  more  general 
estimators. 
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Weighted  ML  Estimation 


Manski  and  Lerman  (1977)  proposed  the  weighted  maximum  likelihood  (WML)  es¬ 
timator.  This  maximizes 

6wml(0)  =  Y]  %  in  /(>V  |x,- ,  9),  (24.17) 

i  H< 

where  //,  =  Hs  and  Q,  =  Qs  if  the  ith  observation  is  in  strata  ,v. 

Manski  and  Lerman  (1977)  called  this  estimator  the  weighted  exogenous  sam¬ 
pling  estimator  (WESML),  since  (24.17)  multiplies  the  usual  term  In  /(>’,■  |x(- ,  9)  in 
the  conditional  likelihood  under  exogenous  sampling  by  the  weight  H;/Qj.  How¬ 
ever,  the  designation  WESML  can  lead  to  confusion  as  the  problem  here  is  one  of 
endogeneity  -  it  just  turns  out  that  appropriately  weighting  the  usual  exogenous  esti¬ 
mator  leads  to  consistent  estimation. 

Along  similar  lines,  the  objective  function  Qwml(^)  is  not  formally  a  likelihood, 
since  (24.16)  does  not  imply  that  the  sample  conditional  density  of  y  given  x  and  ,v  is 
given  by  /s(y|x,  9)  =/(y|x,  9)®^Hs.  Nonetheless,  the  WML  estimator  is  consistent. 
The  WML  estimator  solves  the  first-order  conditions 

Qi  3  In  f(y,\xl,0) 

V— - J  v/l1  "  =  0.  (24.18) 

4^  ^  3  e 

i  1 


This  estimator  is  consistent  if  the  terms  in  the  sum  have  zero  expected  value,  where 
expectation  is  with  respect  to  the  sampling  density  gs(s,  y ,  x\0)  in  (24.16).  Now 


Qs  9hi/(y|x,  9) 

hs  3  e 


-If 

=ff 

=/E 


Qs  3  In  /(y|x,  6)  Hs 
Hs  3  0  Qs{6) 

31n/(y|x,  Q) 


/(>> |x,  9)h(x)dydx 


3  9 

31n  /(y|x,  6)' 

30 


f(y  |x,  6)h(x)dydx 
h(x)dx 


=  0, 


(24.19) 


under  the  usual  regularity  condition  that  in  the  population  the  specified  density  satis¬ 
fies  E[3  In  f(y  |x,  9)/'d9]  =  0.  So  the  WML  estimator  is  consistent  in  the  presence  of 
endogenous  stratification. 

The  information  matrix  equality  does  not  hold  for  objective  function  Qwml(S)  in 
(24.17),  so  we  need  to  use  the  sandwich  form  A^  'A  1 IJA  _  1  for  the  asymptotic  vari¬ 
ance  of  0\vml:  where 


and 


1  N 

A  (90)  =  plim— 

;=i 


Qi  d2\n  /(y,  |x;,  9) 
H  8989' 


B(0O)  =  plim— 


\2  3  In  /(>',! x, 

,  9)  8  In  f(yi  |Xj ,  9) 

\Hi, 

/  89 

89' 

(24.20) 


(24.21) 
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This  estimator  is  less  efficient  than  the  ML  estimator  of  Cosslett  or  Imbens,  but  it  is 
relatively  straightforward  to  implement.  It  does,  of  course,  presume  knowledge  of  the 
strata  probabilities. 


Weighted  m-Estimation 

The  weighted  ML  estimator  can  be  applied  to  estimators  other  than  conditional  ML 
estimation.  Lor  example,  Hausman  and  Wise  (1979)  consider  similar  weighted  estima¬ 
tion  for  least-squares  regression. 

Thus  suppose  with  SRS  we  would  minimize  JT  q(yt  | x, .  9),  with  first-order  condi¬ 
tions  d<l(yi\xi’  0)/d0  =  0,  and  suppose  in  the  population  that 

E[9<?(y|x,  9)/dG)]  =  0, 

a  necessary  condition  for  consistency.  Then  if  sampling  is  instead  endogenously  strat¬ 
ified  as  in  Section  24.2  and  the  sample  and  population  strata  probabilities  Hs  and  Qs 
are  known,  then  6  is  consistently  estimated  by  the  weighted  m-estimator  6fiv  that 
minimizes 

0wW=E  (24.22) 

The  proof  of  consistency  follows  (24.18)  and  (24.19)  for  the  WML  estimator  and 
the  variance  matrix  is  of  the  form  IV-1  A-1  BA-1,  where  A  and  B  are  given  in 
(24.20)  and  (24.21)  with  the  sole  change  being  replacement  of  9  In  /(y;  |x, ,  9)/d6  by 
dq(yj\Xj,  9)/d9.  Wooldridge  (2001)  provides  a  formal  proof. 

Similarly,  for  estimation  based  on  the  q  population  moment  conditions 

E[h(y,  x,  9 )]  =  0, 

under  endogenous  stratification,  use  the  weighted  estimating  equations  estimator 
that  solves 

fh(.vux,  .  9)  =  0. 

i  1 

The  weighted  MLE  results  apply  with  9  In  /(y,  |x,  ,  9)/ 39  replaced  by  h ( y,  x, ,  9). 

Note  that  the  weights  Qj / Hj  are  the  same  as  those  proposed  in  Section  24.3.2  for 
estimation  of  the  census  parameter  under  simple  exogenous  stratified  sampling.  The 
motivation,  however,  is  quite  different.  In  the  current  section  it  is  assumed  that  con¬ 
ditional  moments  are  correctly  specified  so  that  with  exogenous  stratified  sampling  it 
would  be  consistent  and  efficient  to  do  unweighted  estimation.  The  weights  become 
necessary  if  stratification  is  endogenous. 


24.5.  Clustering 

Sections  24.3  and  24.4  on  weighting  and  stratification  covered  methods  to  control  for 
a  survey  design  that  leads  to  a  sample  distribution  that  differs  from  the  population  dis¬ 
tribution.  The  assumption  of  independence  of  sampled  observations  was  maintained. 
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In  fact  survey  data  are  usually  dependent.  This  may  be  due  to  use  of  clustered  sam¬ 
ples  to  reduce  survey  costs,  such  as  interviewing  several  households  on  the  same  block. 
In  such  cases  the  data  may  be  correlated  within  a  cluster  owing  to  the  presence  of  a 
common  unobserved  cluster-specific  term.  Such  dependence  may  also  arise,  however, 
even  with  SRS.  For  example,  it  may  be  felt  that  there  is  an  unobservable  effect  com¬ 
mon  to  all  households  in  the  same  state. 

There  are  several  different  methods  for  controlling  for  dependence  on  unobserv¬ 
ables  within  a  cluster.  If  the  within-cluster  unobservables  are  uncorrelated  with  regres¬ 
sors  then  only  the  variances  of  the  regression  parameters  need  to  be  adjusted.  If  instead 
the  within-cluster  unobservables  are  correlated  with  regressors  then  the  regression  pa¬ 
rameters  are  inconsistent  and  suitable  alternative  estimators  are  needed.  The  analysis 
is  further  complicated  because  methods  may  also  vary  according  to  whether  there  are 
many  small  clusters  or  few  large  clusters.  Additional  complex  survey  complications 
such  as  weighting  and  stratification  are  deferred  to  Section  24.6. 

The  notation  and  models  are  presented  next,  with  the  key  distinction  being  between 
random  cluster  effects  and  fixed  cluster  effects,  similar  to  panel  data  analysis.  The 
various  estimators  are  presented  in  subsequent  sections. 


24.5.1.  Cluster-Specific  Effects  Models 

Interest  lies  in  estimation  of  a  linear  regression  model  given  data  (y, ,  x,  ),  i  =  1 ,  . . . ,  N, 
where  i  denotes  the  i  th  sample  observation,  such  as  a  household. 

The  concern  is  that  some  aspects  of  the  population  regression  model  vary  by  cluster 
c,  c  =  1 , ,C.  Suppose  the  i  th  household  in  the  overall  sample  is  the  j  th  household 
in  the  cth  sampled  cluster.  A  quite  general  model  for  clustered  data  is 

yjc  =  x^Pc  +  *  jo  J  =  1.  ••  •,  Nc,  c  =  1, . ..,  C,  (24.23) 

where  Cov[m/c,  uk(  |  /  0  though  Cov[wJC>  u kd]  =  0  for  c  7^  d.  This  model  incorpo¬ 
rates  cluster  dependence  through  both  regression  parameters  that  vary  across  clusters 
and  errors  that  are  correlated  within  a  cluster. 

Here  we  focus  on  a  special  case,  the  cluster-specific  effects  model 

yJc  =  x'c/3  +  ac  +  sjc.  (24.24) 

Here  just  the  regression  intercept  ac  varies  across  clusters,  whereas  the  slope  coeffi¬ 
cients  are  assumed  to  be  constant  across  clusters.  In  the  simplest  model  sJC  is  assumed 
to  be  homoskedastic. 


Sjc  ~  [0,  er2],  (24.25) 

an  assumption  that  can  be  relaxed  to  permit  heteroskedasticity  and  correlation  within 
a  cluster.  More  substantively,  different  assumptions  on  ac  lead  to  two  quite  different 
models,  which  we  now  present. 
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Cluster-Specific  Random  Effects 

In  the  cluster-specific  random  effects  (CSRE)  model  the  intercepts  ac  in  (24.24) 
are  purely  random  with  distribution  that  does  not  depend  on  any  observables.  In  the 
simplest  case  it  is  assumed  that 

ae~[0,a2].  (24.26) 

This  model  is  directly  analogous  to  the  random  effects  model  for  panel  data.  The 
model  is  just  a  linear  regression  of  yjc  on  xjc,  with  the  complication  that  the  error 
term  ac  +  SjC  is  correlated  for  observations  in  the  same  cluster.  An  OLS  estimation 
is  consistent  but  inefficient.  Importantly,  the  correlation  of  errors  makes  it  necessary 
to  adjust  the  usual  standard  errors  of  the  OLS  estimator.  A  GLS  estimation  is  more 
efficient. 

Given  assumptions  (24.25)  and  (24.26)  on  sjc  and  ac,  Vffic..  +  eJC]  =  cr2  +  er2 
and  Covin',,  +  SjC,  ac  +  Skc]  =  cr2,  for  k  /  j .  We  define  the  intraclass  correlation 
coefficient 

2 

p  =  Cor[o;c  +  ejc  ,uc  +  £kc]=  °a  ,•  (24.27) 

There  is  a  one-to-one  correspondence  between  (or2,  <r2)  and  (a2,  p),  where  p  is 
defined  in  (24.27)  and  or2  =  cr2  +  or2.  The  CSRE  model  is  equivalent  to  a  model  with 
constant  intraclass  correlation  coefficient.  The  model  can  also  be  given  a  Bayesian 
interpretation,  viewing  each  observation  as  having  its  own  intercept  aJC  that  is  a  draw 
from  a  univariate  distribution  and  appealing  to  the  exchangeability  criterion  that  the 
subscript  in  aJC  is  a  purely  labeling  device  and  has  no  substantive  consequences.  In  all 
cases  clustering  has  the  expected  effect  of  inducing  positive  correlation  between  error 
terms  within  a  cluster. 


Cluster-Specific  Fixed  Effects 

In  the  cluster-specific  fixed  effects  (CSFE)  model  the  intercepts  ac  in  (24.23)  are 
random  unobservables,  as  for  the  CSRE  model,  but  may  possibly  be  correlated  with 
the  regressors.  For  identification  Xjc  no  longer  includes  an  intercept  term. 

This  model  is  directly  analogous  to  the  fixed  effects  model  for  panel  data.  The  model 
has  conditional  mean  E[y/C|x/C,  ac  \  =  x'  c/3  +  olc.  The  OLS  estimator  from  regression 
of  yjc  on  Xjc  alone  is  inconsistent  for  (3  if  the  omitted  variable  ac  is  correlated  with  xjc. 
Consistent  estimation  of  (3  requires  consistent  estimation  of  a,  ,  which  is  possible  if  the 
clusters  are  large.  If  clusters  are  instead  small  the  individual  ac  need  to  be  eliminated 
by  a  differencing  transformation. 


Comparison  to  Panel  Data  Analysis 

The  setup  and  terminology  clearly  closely  parallels  that  for  static  panel  data  analysis 
presented  in  Chapters  21  to  23.  At  the  same  time  there  are  some  departures  from  panel 
data  analysis. 
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In  the  panel  case  the  individual  unit  of  analysis,  such  as  the  household,  is  observed 
more  than  once  whereas  in  the  cluster  case  the  individual  unit  of  analysis  is  observed 
only  once.  In  the  panel  notation  it,  the  first  subscript  is  the  clustering  unit  if  the  panel 
is  a  short  panel,  whereas  in  the  clustering  notation  jc,  the  second  subscript  is  the 
clustering  unit.  In  the  panel  case  we  focused  on  balanced  panels,  but  clustered  data  are 
usually  unbalanced  as  Nc  varies  across  clusters. 

Microeconometrics  methods  for  panel  data  focus  on  short  panels.  This  is  analogous 
to  having  few  observations  per  cluster  and  many  clusters.  Then  Nc  is  small  and  C  — »■ 
oo,  which  we  call  small  clusters.  In  addition,  it  is  not  unusual  to  have  large  clusters, 
with  Nc  — »•  oo  and  C  small.  For  the  CSFE  model  with  large  clusters  there  will  only  be 
a  few  parameters  ac  to  estimate  and  the  incidental  parameters  problems  will  not  arise. 

Unlike  as  in  panel  data,  the  appropriate  clustering  unit  may  not  always  be  clear.  For 
example,  for  the  CPS  data  clustering  could  be  viewed  as  arising  within  state,  within 
strata,  within  PSU,  or  within  USU.  This  issue  is  deferred  to  Section  24.6.  The  intra- 
cluster  correlation  is  expected  to  decrease  for  clustering  at  more  aggregate  levels.  If 
clustering  is  at  the  state  level  then  the  clusters  are  large,  whereas  if  clustering  is  viewed 
as  being  at  the  level  of  USU  then  the  clusters  are  small.  Moreover,  it  is  possible  that  a 
data  set  does  not  include  necessary  clustering  information,  such  as  the  strata  or  USU 
for  an  observation. 

The  analogue  of  dynamic,  rather  than  static,  panel  data  models  is  a  model  where 
yJC  depends  not  only  Xjc  but  also  on  x*c,  for  k  f=-  j .  For  clustered  data  it  is  usually 
sufficient  to  specify  a  peer-effects  model  that  more  simply  includes  just  the  cluster 
average  xc,  since  the  ordering  of  observations  within  a  cluster  usually  does  not  matter. 


Overview 

The  three  common  estimators  for  clustering  are  the  OLS,  the  GLS,  and  the  within 
estimators  presented  in  Sections  24.5.2-24.5.4.  The  properties  of  these  estimators, 
summarized  in  Table  24.2,  vary  with  the  true  model.  Most  importantly,  if  the  true 
model  has  cluster-specific  fixed  effects  then  OLS  and  RE  estimators  are  inconsistent, 
whereas  the  within  estimator  yields  consistent  estimates  but  only  for  coefficients  of 
regressors  that  vary  within  a  cluster.  Secondly,  even  if  an  estimator  is  consistent  the 
usual  standard  errors  will  often  need  to  be  adjusted  to  control  for  clustering  and  possi¬ 
bly  heteroskedasticity  as  detailed  in  the  following. 


Table  24.2.  Properties  of  Estimators  for  Different  Clustering  Models 


Section 

Estimator 

Cluster  Model 

Consistent 

24.5.2 

OLS 

Random  effects 

Yes 

Fixed  effects 

No 

24.5.3 

GLS  for  random  effects 

Random  effects 

Yes 

Fixed  effects 

No 

24.5.4 

Within  for  fixed  effects 

Random  effects 

Yes 

Fixed  effects 

Yes 
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24.5.2.  OLS  Estimator 
We  consider  the  OLS  regression 

yjc  =  x'jc(3  +  UjC.  (24.28) 

Ordinary  LS  is  inconsistent  because  of  omitted  variables  bias  if  the  true  model  is  the 
CSFE  model  (i.e.,  Ujc  =  ay  +  Sjc)  with  fixed  effect  ac  correlated  with  x/r.  Then  the 
OLS  estimator  should  not  be  used  and  instead  the  CSFE  estimators  of  Section  24.5.4 
should  be  used. 

In  contrast,  OLS  is  consistent  in  the  CSRE  model,  where  ac  is  a  random  effect 
uncorrelated  with  x;c.  More  generally,  OLS  is  consistent  under  richer  models  for  u  /c 
than  the  CSRE  model,  provided  UjC  is  uncorrelated  with  x;c.  We  consider  the  OLS 
estimator  in  this  case,  with  focus  on  obtaining  correct  standard  errors  given  correlation 
of  the  error  term  UjC  within  a  cluster. 


Notation 

Stacking  observations  in  (24.28)  within  a  cluster  yields 

yc  =  Xcf3  +  u0 


(24.29) 


where  yc  and  uc  are  Nc  x  1  vectors  and  Xc  is  an  Nc  x  K  matrix.  Further  stacking  over 
clusters  yields 


y  =  x/3  +  u, 


(24.30) 


where  y  and  u  are  N  x  1  vectors  and  X  is  an  N  x  K  matrix,  N  =  Nc. 

The  three  representations  of  the  CSRE  model  lead  to  three  equivalent  ways  of  ex¬ 
pressing  the  OLS  estimator  of  model  (24.28), 


Pols  =  (X'X)  1  X'y 

=  (f>cXe)  £x'cyc 


(24.31) 


-l 


Vc=l 
C  Nc 


c=  1 


C  Nc 


=  I]  E  XjcX'jc  J2  Xjcyjc- 


Vc=l  7=1  /  c=  1  7=1 

The  second  of  these  representations  is  especially  useful  given  the  assumption  of 
independence  of  errors  across  clusters.  Then,  as  before  in  the  panel  case,  the  OLS 
estimator  has  limit  distribution 


where 


ViV(3OLS-/3)  4Af[0,A-1BA~1] 
c 

A  =  plimA”1  ^X'fXc, 

C—  1 

C 

B  =  plimA-1  J^X'cucu'cXc, 


(24.32) 


(24.33) 
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using  independence  of  ue  over  c.  Different  assumptions  on  the  properties  of  uc  lead  to 
different  estimates  of  B. 


OLS  Cluster-Robust  Standard  Errors 


If  clusters  are  small  then  there  are  many  clusters  and  B  in  (24.33)  can  be  consistently 
estimated  by  replacing  uc  by  uc  =  yc  —  Xc(3.  It  follows  that  /3OLS  is  asymptotically 

normally  distributed  with  cluster-robust  variance  matrix 


V  [/^OLs] 


Ex'n'n'x<(Exix< 


C—  1 


c=  1 


(24.34) 


This  formula  places  no  restriction  on  heteroskedasticity  and  correlation  within  a 
cluster,  as  V[ur|  and  hence  V[m/c]  and  Cov[«/c,  UkA  are  unrestricted.  However,  it 
does  assume  that  Nc  is  small  and  C  — »■  oo  .  Statistical  packages  often  give  a  degrees- 
of-freedom  correction.  Typically  one  multiplies  the  estimate  in  (24.34)  by 


dfc  = 


N  -  1 


N  -  K 


x 


C 

c-  r 


which  corrects  for  both  estimation  of  f3  and  the  number  of  clusters  in  practice  being 
finite. 

To  see  how  (24.34)  works,  treat  the  regressors  as  fixed  and  note  that 


c 

B  =  limlV-1  E  KE  K<]  Xc 

C=  1 

C  Nc  Nc 

=  E  J2  X! E  [uJ<u'k<]  xj?x'kc- 

C—  1  7=1  k=  1 

Then  (24.34)  is  obtained  using  the  estimate 

c 

B  =  r'Ex'Mxc 

c=  1 

C  Nc  Nc 

=  N~l  EEE  ^  jc^kc^jc^/cc  ’ 

c=  1  7=1  k=  1 


For  example,  consider  estimation  of  E[y]  by  y.  This  is  the  regression  (24.28) 
with  Xjc  =  1,  /3OLs  =  y,  and  Ujc  =  yjc  —  y.  Then  (24.34)  leads  to  V[y]  = 
N  2  J2c(J2j(yjc  —  >’))2,  compared  to  the  estimate  of  N~ 1  j Cv./e  —  v)2  which 

additionally  assumes  independence  within  clusters. 


OLS  Standard  Errors  Assuming  the  CSRE  Model 

The  cluster-robust  estimates  (24.34)  require  many  clusters.  Alternative  estimates  that 
also  apply  to  the  case  of  few  clusters  can  be  used  if  assumptions  are  made  about  the 
variances  and  covariances  of  the  model  error  ujc.  These  alternative  estimates  also  per¬ 
mit  analytical  results  regarding  the  impact  of  clustering  on  estimator  variances. 
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In  particular,  assume  that  the  CSRE  model  given  by  (24.24)  to  (24.26)  is  appropri¬ 
ate.  Then  the  error  ujc  =  ac  +  Sjc  is  independent  over  c  and  within  a  cluster 


Covfiz  j c ,  n  I  — 


a2, 

per2 


J  =  k, 

j  ¥=  k. 


where  the  intraclass  correlation  coefficient  p  is  defined  in  (24.27).  It  follows  that 


£c  =  V[ue]  =  <t2[(1  -  p)Ie  +  pec<L 


(24.35) 


where  I,  is  an  Nc  x  Nc  identity  matrix  and  er  is  an  Nc  x  1  vector  of  ones. 
Given  £c  in  (24.35),  the  general  result  (24.32)  to  (24.33)  yields 


-l 


v  [3ols]  =  E  X^X<  E  'W  -  PVc  +  P^r]\c  E  x'^ 


C—  1 


C=\ 


c=  1 


(24.36) 


Provided  the  intraclass  correlation  coefficient  is  constant,  this  variance  matrix  estima¬ 
tor  is  consistent  in  both  the  small-  and  large-cluster  cases.  Obvious  estimators  for  a2 
and  p  are 


=  — - — ee 

N  -  K  -  1  *-! 


c=  l  i= i 


and 


P  = 


EcNc(Nc-  l)a2jrijri 


1  C  Nc  Nc 

F2EEEE 


Ukc- 


k^j 


The  estimate  of  p  involves  many  intracluster  pairs  and  a  consistent  estimate  can  be 
obtained  using  just  a  subset  of  these.  As  written  Ec  NC(NC  —  1)  pairs  are  used,  though 
in  fact  each  unique  within-cluster  pair  is  double  counted  as  both  U  j,T>kr  and  'UkcUjc 
appear  in  the  summations. 

If  the  clusters  are  large  the  intracluster  correlation  can  be  permitted  to  vary  across 
clusters.  Then  (24.35)  and  (24.36)  can  be  amended  to  replace  a2  and  p  by  a2  and  pc, 
respectively.  These  can  be  consistently  estimated  by 


N< 


— — E 


)C 


j= 1 


and 


1 


Pc 


Nc(Nc-  l); 


^EE 

C  7=1  kjij 


M  jc^kc  • 
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Bias  of  Usual  OLS  Standard  Errors 

If  data  are  clustered,  then  intuitively  the  usual  formula  variance  estimator  for  the  OLS 
estimator, 

V"[3ols]=^2^Ex'X^  , 

underestimates  the  true  variance  matrix  of  the  OLS  estimator,  assuming  positive 
within-cluster  correlation,  since  each  additional  observation  within  a  cluster  will  pro¬ 
vide  less  than  one  additional  piece  of  independent  information.  We  demonstrate  this 
bias  when  the  error  process  is  that  of  the  CSRE  model. 

Consider  the  CSRE  model  with  the  same  regressors  within  each  cluster,  so  x;c  =  xc 
and  Xc  =  e,x'..  Then  by  using  e',ec  =  Nc,  (24.36)  becomes 

(c  \  —  i  c  /  c 

J2  Ncxcx'c  J2  N +  P(N'-  -  !)]»c<  ( £ 

c=l  /  c=l  \c=l 

a  result  presented  by  Kloek  (1981)  and  Moulton  (1986). 

Now  specialize  to  balanced  clusters,  and  define  M  to  be  the  average  cluster  size, 
so  M  =  Nc  =  N/C  is  constant.  Then  the  variance  estimate  simplifies  to 


—i 


V  [3ols]  =  [1  +  P(M  -  1)]  x  a2  Im  J2  2 


C—  1 


whereas  the  formula  variance  simplifies  to  o2(M  xrx()  1 .  It  follows  that  the  true 
variances  are  a  multiple 


r  =  [1  +  p(M  —  1)] 

times  the  usual  OLS  variance  matrix  estimate.  Even  if  p  is  small  the  correction  fac¬ 
tor  can  be  quite  large.  For  example,  if  the  average  cluster  size  is  M  =  101  obser¬ 
vations,  then  the  usual  OLS  standard  errors  should  be  multiplied  by  yd  +  100/). 
The  assumed  independence  within  a  cluster  will  also  lead  to  a  biased  estimate 
of  a2,  but  this  is  of  second-order  importance.  In  the  balanced-cluster  case  Kloek 
shows  that  L[^c  JUffd]  =  a2[N  —  K(  1  +  p(m  —  1))]  so  we  should  normalize  by 
[N  -  K(  1  +  p{m  -  1))]  1  rather  than  [N  -  AT1. 

In  practice  some  regressors  may  be  constant  within  a  cluster  and  others  may 
vary.  Then  in  the  case  of  regression  with  intercept  and  scalar  regressor  (i.e.,  x'v,/3  = 
j-’>\  +  f>2-Xjr)  Scott  and  Holt  (1982)  show  that  the  usual  OLS  formula  variance  for  the 
intercept  should  be  multiplied  by  1  +  p(M  —  1)  as  done  in  the  preceding,  but  for  the 
slope  coefficient  it  should  be  multiplied  by  the  smaller  factor  1  +  'pxp{M  —  1),  where 
pf  can  be  viewed  as  an  estimate  of  the  intraclass  correlation  coefficient  of  the  xjc.  In 
cross-section  applications  'px  is  relatively  small,  so  the  main  problem  lies  with  standard 
errors  for  cluster-invariant  regressors. 
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Moulton  (1986)  demonstrated  in  an  application  that  the  bias  in  standard  errors  us¬ 
ing  the  incorrect  OLS  formula  variance  can  be  quite  appreciable.  He  estimated  a  log- 
wage  equation  using  cross-section  CPS  data  where  clustering  was  on  states.  For  his 
application  N  =  18,946  and  C  =  49.  For  his  data  the  estimated  intraclass  correlation 
coefficient  was  p  =  0.032,  a  seemingly  small  value.  However,  the  clusters  are  large, 
and  if  we  ignore  the  data  being  unbalanced  and  as  a  guide  use  the  preceding  for¬ 
mulas  with  M  =  387,  the  average  cluster  size,  then  r  =  [1  +  p(M  —  1)]  =  13.3.  For 
state-invariant  regressors  the  true  OLS  standard  errors  are  predicted  to  be  \J  1 3.3  =  3.7 
times  the  usual  reported  standard  errors,  a  very  large  bias.  (One  way  to  view  this  is 
that  for  OLS  estimation  of  the  coefficients  of  state-invariant  regressors,  the  18,946 
clustered  observations  have  the  same  precision  as  18,946/13.3  =  1,425  independent 
observations.)  For  individual-varying  regressors  the  bias  will  be  much  smaller,  for  ex¬ 
ample,  [1  +  pA  p( M  —  1)]  =  2.23  if  pA  =  0.10.  Moulton  does  not  report  results  for  the 
individual- varying  regressors  included  as  regressors.  For  the  state-invariant  regressors, 
variables  such  as  growth  rate  of  employment  in  the  state,  the  cluster-corrected  standard 
errors  for  OLS  are  generally  between  three  and  four  times  the  incorrect  formula  stan¬ 
dard  errors. 

The  lesson  is  that  there  can  be  great  downward  bias  in  the  default  OLS  standard 
errors  for  the  OLS  coefficients  of  cluster-invariant  regressors.  For  individual-varying 
regressors  there  is  also  bias,  but  it  is  much  less.  Cluster-invariant  regressors  are  of¬ 
ten  included  in  applications  with  clustered  data,  as  it  is  common  to  model  individual 
behavior  as  depending  in  part  on  attributes  of  the  cluster.  Valid  statistical  inference 
requires  obtaining  standard  errors  that  control  for  clustering. 


24.5.3.  Cluster-Specific  Random  Effects  Estimator 

If  a  random  effects  model  is  appropriate  then  the  GLS  estimator  is  in  general  more 
efficient  than  the  OLS  estimator  of  the  previous  section.  Given  independence  across 
clusters  the  GLS  estimator  of  model  (24.29)  is 

3gls,re  =  gx'S^yo  (24.37) 

where  Sc  =V[uc].  The  feasible  GLS  estimator  replaces  £c.  by  a  consistent  estimate 
Sc,  and  assuming  correct  specification  of  the  model  (24.29)  and  error  variance  matrix 
SA,  we  have 

V[3gls,rk]=  . 

For  the  CSRE  model,  Sc  given  in  (24.35)  can  be  consistently  estimated  by  Xc, 
which  replaces  o 2  and  p  by  the  consistent  estimates  given  after  (24.36).  As  in  the  sim¬ 
ilar  random  effects  model  for  panel  data,  the  feasible  GLS  estimator  is  asymptotically 
equivalent  to  the  MLE  under  the  additional  assumptions  that  ac  and  sjr  are  normally 
distributed. 
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An  attraction  of  the  CSRE  model  is  that  the  GLS  estimator  (24.37)  can  be  simply 
implemented  by  OLS  estimation  of  the  transformed  regression 

yjc  ~  0cyc  =  (xjc  -  0cxc)'f3  +  ( sjc  -  ecsc),  (24.38) 


where 


0r  =  1 


Vi  -  P 


VI  +  P  (Nc  —  1) 


=  1 


jal  +  Ncol 


(24.39) 


This  result  is  proven  later  in  this  section.  To  implement  it  we  replace  9C  by  consis¬ 
tent  estimate  9C.  As  for  the  panel  data  model,  it  can  be  shown  that  usual  OLS  stan¬ 
dard  eiTors  from  this  regression  can  be  used  if  the  errors  Sjc  in  model  (24.24)  are 
homoskedastic. 

The  GLS  estimator  is  at  least  as  efficient  as  OLS  assuming  (24.24)  to  (24.26)  hold. 
In  the  special  case  that  all  regressors  are  cluster-invariant  there  is  no  efficiency  gain  as 
GLS  then  coincides  with  OLS  (Kloek,  1981).  More  generally,  Scott  and  Holt  (1982) 
give  a  quite  conservative  upper  bound  to  the  efficiency  loss  of  OLS  compared  to 
GLS  as 


Vjc'^GLs]  >l_(l  +  4(1  -  P)[l  +  p(No  -  1)] 

V[c'3ols]  “  V  N2p2 

for  arbitrary  vector  c  and  where  Nq  =  max{  Nc ■}  is  the  sample  size  of  the  largest  cluster. 
This  bound  is  increasing  in  Nq  and  p,  and  even  for  No  =  1,000  and  p  =  0. 10,  OLS  is 
at  most  22%  less  efficient  than  GLS. 

Given  these  small  efficiency  gains  to  GLS  it  is  more  common  to  focus  on  OLS 
estimation  with  correct  standard  errors,  unless  OLS  is  inconsistent  because  the  CSFE 
model  is  appropriate.  The  main  impact  of  clustering  is  that  OLS  is  much  less  efficient 
compared  to  the  case  of  no  clustering,  as  is  clear  from  the  discussion  of  calculation  of 
standard  errors  for  the  OLS  estimator  in  Section  24.5.2. 

If  clusters  are  large,  then  the  CSRE  model  can  be  relaxed  to  permit  the  error  vari¬ 
ance  and  intraclass  correlation  to  vary  across  clusters.  Then  in  (24.35)  for  Xr  we  re¬ 
place  o2  and  p  by  a 2  and  pc,  respectively,  using  consistent  estimates  for  a 2  and  pc 
given  after  (24.36). 

If  clusters  are  small  then  robust  standard  errors  that  do  not  constrain  error  corre¬ 
lation  to  be  constant  within  a  cluster  can  be  obtained,  analogous  to  (24.34)  for  OLS. 
Then 


V  [/^gls.re]  — 


E 

C—  1 


x' ‘x,. 


-1 


E: 

c=  1 


=1-1/2- 


S-l/2 


X, 


X, 


c=  1 


—  1 


where  uc  =  yc  —  Xr/3G[  S  Rl:.  This  estimate  requires  Nc  small  and  C  o o,  and  it  as¬ 

sumes  independence  of  errors  in  different  clusters. 
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GLS  Implemented  as  OLS  in  a  Transformed  Model 
To  derive  (24.38),  note  that  for  Xc  defined  in  (24.35) 

s;1  =  [a2[(l-p)Ic  +  pece']]~' 

=  - r  [I  c-iPhc^cr\ 

cr-(l  -  p) 

where  rc  =  1  +  p(Nc  —  1 )  and  hence 

St71/2  =  ' - [Ic  -  (6c/Nc)ece'c], 

a  yj  1  —  p 

using  the  general  results  that  if  e  is  an  M  x  1  vector  of  ones  then 

[I  +  uee']'1  =  I-[u/(l  +  aM)]ee', 

[I  +  uee'] 1/2  =  I -M"1  (l  -  Vl  +  am)  Mee'. 

Now  in  (24.37)  X'.S^X,  =  (S;1/2Xt)'  S“1/2XC.,  where 

SC“1/2XC  =  [Ic  -  (ec/Nc)ece'c]Xc 
—  Xc  -  Qcecx'c 

and  where  xc  =  N~l  •  x;e  and  we  ignore  the  scalar  multiple  I  / a  v/ 1  —  p  as  it  will 
cancel  out  when  we  similarly  consider  X'.Sr'y,  .  The  transformed  regression  model 
(24.38)  follows. 

24.5.4.  Cluster-Specific  Fixed  Effects  Estimator 

The  basic  idea  of  the  CSFE  model  is  straight  forward:  Let  the  cluster  effect  enter  the 
conditional  mean  function  through  the  intercept  term.  The  model  is 

yjc  =  ac +  x'jc(3  +  sjc,  j=  1 - ,  Nc,  c  =  1, ... . . ,  C,  (24.40) 

where  now  both  (3  and  ac,  c  =  1 , . . . ,  C,  are  parameters  to  be  estimated. 

In  the  CSFE  model  all  cluster-invariant  regressors  must  be  dropped,  as  they  cannot 
be  separately  identified  from  ac.  For  example,  if  clustering  is  on  the  state  and  a  fixed 
effects  model  is  appropriate  then  the  effect  of  state-invariant  regressors  such  as  state 
average  unemployment  cannot  be  identified.  If  estimation  of  the  coefficients  of  state- 
invariant  regressors  is  desired  then  OLS  or  the  CSRE  estimator  need  to  be  used  instead. 
However,  one  should  first  use  a  Hausman  test  analogous  to  that  presented  in  Chapter  21 
for  panel  data  to  confirm  the  validity  of  the  strong  assumption  of  the  CSRE  model  that 
ac  is  uncorrelated  with  the  regressors. 

We  consider  statistical  inference  under  the  assumption 

Sjc  ~  [0,  &jc]- 

This  permits  heteroskedasticity  of  unknown  form  but  assumes  that  inclusion  of  the 
cluster-specific  fixed  effect  ac  is  sufficient  to  control  for  any  error  correlation  within 
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a  cluster.  This  is  a  departure  from  panel  data  analysis  where  concern  about  time- 
series  correlation  in  the  errors  even  after  inclusion  of  individual-specific  effects  leads 
to  richer  models.  If  desired,  however,  one  can  additionally  adjust  estimator  standard 
errors  for  correlation  within  a  cluster  by  methods  similar  to  those  in  Section  24.5.2. 

The  main  complication  in  estimation  of  the  CSFE  model  is  that  in  small  clusters 
there  are  too  many  intercepts  ac  to  estimate. 


Cluster  Dummy  Variables  Model 

We  first  consider  large  clusters,  where  the  number  of  clusters  is  small  relative  to  the  to¬ 
tal  sample  size.  Then  the  intercepts  ac  can  be  estimated  directly  by  introducing  dummy 
variables  for  each  cluster  and  estimating  by  OLS. 

Let  observation  i  denote  the  /tli  household  in  the  cth  cluster.  Then  (24.40)  can  be 
written  as  the  cluster  dummy  variables  model 

c 

yi  =  ^2  a cdci  +  x'/3+  e,-,  i  =  1, . . . ,  N,  (24.41) 

c=  1 

where  the  dcl  are  indicator  variables  that  equal  one  if  the  ith  observation  belongs  to 
cluster  c  and  equal  zero  otherwise.  Thus  C  cluster  indicator  variables,  such  as  state 
dummy  variables,  are  included,  and  to  avoid  the  dummy  variable  trap,  x  should  not 
contain  an  intercept  term. 

An  OLS  estimation  of  this  model  yields  consistent  estimates  of  both  oq, . . . ,  ac 
and  /3,  assuming  a  fixed  number  of  clusters  C  as  N  — >  oo.  One  can  use  the  usual 
Eicker-White  estimate  to  obtain  standard  errors  that  are  robust  given  heteroskedastic 
errors. 


Within-Clusters  Estimator 

When  there  are  many  small  clusters  we  can  no  longer  estimate  the  model  (24.40)  by 
OLS.  First,  OLS  estimation  may  not  be  computationally  feasible  because  the  number 
of  parameters  (C  +  K)  oo  as  the  number  of  clusters  C  —>  oo.  Second,  and  more 
importantly,  because  the  number  of  parameters  is  going  to  infinity  with  the  sample 
size,  the  OLS  estimator  is  inconsistent  unless  Nc  oo. 

Interest  usually  lies  in  the  parameters  (3  in  (24.40),  with  a\, ...  ,ac  viewed  as  inci¬ 
dental  parameters  or  as  nuisance  parameters.  Then  it  is  convenient  to  sweep  out  the 
fixed  effects  by  an  initial  data  transformation.  Each  observation  (y;o  x;r)  is  replaced 
by  deviation  from  the  cluster  mean,  that  is,  by  (y;c  —  yc,  x;e  —  xc),  i  =  1, . . . ,  Nc, 
c  =  1 , ...  ,C,  where  yc  =  N~l  JV  y/c  and  xc  =  N~x  x;c  are  cluster-specific  av¬ 
erages.  Then  the  model  (24.40)  for  v;c  implies  that 

yjc  -  yc  =  (xjc  -  Xc)'fl  +  £ jc  -  sc.  (24.42) 

Applying  OLS  to  the  transformed  regression  (24.42)  yields  a  consistent  estimate 
of  (3.  If  the  CSFE  coefficients  are  also  of  interest,  they  can  be  estimated  by  ac  = 
yc  —  x'c{3,  though  this  estimate  is  not  consistent  for  small  Nc. 
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Comparison  with  Chapter  2 1  shows  that  this  is  analogous  to  the  within  estimator 
for  panel  data.  As  for  panel  data,  the  estimate  of  /3  from  OLS  estimation  of  (24.42) 
coincides  with  the  estimate  of  / 3  from  OLS  estimation  of  the  cluster  dummy  variables 
model  (24.41). 

A  between  estimator  can  also  be  proposed  analogous  to  that  for  linear  panel  mod¬ 
els.  In  this  case  yc  is  regressed  on  xc,  c  =  1, . . . ,  Nc.  From  (24.37),  the  GLS  estimator 
in  the  CSRE  model  involves  regression  in  quasi-differences,  where  cluster  means  are 
multiplied  by  0C  (defined  in  (24.39))  before  differencing.  The  GLS  estimator  can  be 
shown  to  be  a  linear  combination  of  the  within  and  between  estimators.  It  approaches 
the  within  estimator  for  large  Nc  as  then  9C  — »•  1.  Note  that  the  within  estimator  is 
consistent  in  the  CSRE  model. 

Caution  is  necessary  in  interpreting  the  standard  errors  if  the  regression  is  applied  to 
the  mean-corrected  observations.  The  number  of  degrees  of  freedom  for  this  regression 
is  (N  —  K  —  C),  not  (N  —  K).  If  software  neglects  this  adjustment  then  the  residual 
variance  from  the  software  should  be  adjusted  by  multiplying  by  the  inflation  factor 
(N  —  K)  /  (N  —  K  —  C)  and  the  standard  errors  should  be  inflated  by  the  square  root 
of  the  same. 


24.5.5.  Diagnostic  Tests  for  Cluster  Effects 

In  linear  regression  a  test  of  cluster-specific  fixed  effects  under  normality  of  errors  is 
just  the  standard  F-test  of  linear  restrictions  hypothesis  H0  :  a \  =  a2  =  ■  ■  ■  =  ac  =  0 
in  (24.40).  This  simply  involves  a  comparison  of  the  R 2  statistic  for  the  two  regressions 
with  and  without  the  cluster-specific  dummy  variables. 

In  the  CSRE  model  a  test  of  cluster  effects  is  a  one-sided  test  of  the  null  hypothesis 
Ho  :  a2  =  0  versus  Hi  :  a2  >  0.  An  equivalent  test  can  also  be  formulated  as  a  test  of 
Hq  :  p  =  0  versus  H\  :  p  >  0  using  the  definition  in  (24.27).  The  one-sided  LM  test 
statistic  of  this  hypothesis,  given  by  Moulton  (1987),  is 


LM  = 


T,ANcucf  -  £L.£,-»% 

^2[2(£^c2-;v)]V2  ’ 


(24.43) 


where  a2  =  £f.  £  •  u2c/N ,  UjC  denotes  the  least-squares  residual  from  the  pooled  re¬ 
gression  of  y  on  x,  and  uc  is  the  average  residual  for  cluster  c. 


24.5.6.  Clustering  in  Nonlinear  Models 

Nonlinear  models  with  clustered  data  have  not  attracted  much  attention  in  the  econo¬ 
metrics  literature.  There  are  numerous  published  articles  in  biostatistics,  however,  with 
a  special  focus  on  binary  outcome  models  (Pendergast  et  al.,  1996).  Other  models  such 
as  the  Poisson  regression  and  some  models  for  survival  data  have  also  been  considered. 
The  hierarchical  (multilevel)  modeling  framwork  has  also  been  used  extensively  espe¬ 
cially  for  binary  outcome  models. 

Here  we  continue  to  exploit  the  parallel  between  clustered  and  panel  data.  As  in  the 
linear  case  the  data  (y,-,  x,),  i  =  1, . . . ,  N,  are  subscripted  as  (y;c,  x;-c),  j  =  l, ,  Nc, 
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c  =  1 , . . . ,  C.  We  assume  independence  over  c  but  permit  dependence  of  observations 
within  cluster  c. 


m-Estimation  with  Clustering 

Consider  a  nonlinear  estimating  equations  estimator  that  solves 

C  Nc 

j2'£ih(yjc,xjc,6)  =  0.  (24.44) 

c=l  7=1 


Often  these  equations  are  obtained  from  maximization  or  minimization  of  the  objective 
function  J2C  E;  x;o  0),  in  which  case  h(yyc,  xjc,  0)  =  dq(yjc,  xjc,  G)/dO.  For 
example,  for  quasi-MLE  based  on  the  product  of  marginal  densities  h(y;c,  x/c,  B)  = 
d\nf(yjc\Xjc,e)/dB. 

We  assume  that  data  are  clustered,  so  that  Cov[h/(.,  h^c]  /  0.  However,  we  maintain 
the  assumption  that  E[h(vyc,  x;c,  9)]  =  0,  a  necessary  condition  for  consistency,  which 
rules  out  the  cluster-specific  fixed  effects  model  also  presented  in  the  following. 

The  cluster-robust  variance  of  the  OLS  estimator  (24.34)  is  easily  adapted  to  the 
current  situation  by  replacing  XjCx!jC  by  3h jC/d9r  and  x/r«  ;c  by  h  jC{9).  Then  6  is 
asymptotically  normal  with  cluster- robust  variance  matrix 


~  ~  (c  ^  9h' 

v^=  EE 

\c=l  7=1 


dd 


C  Nc  Nc  /  C  Nc 

EEEmSimS)' EE  " 

c=I  7=1  fc=  1  \c=l  7=1 


dd 


-l 


(24.45) 


Some  computer  software  provides  this  as  a  standard  option  for  many  parametric  non¬ 
linear  models. 

A  leading  example  is  quasi-ML  estimation  based  on  the  product  of  marginal  densi¬ 
ties  within  a  cluster  rather  than  the  joint  density.  Specifically,  given  dependence  over 
j  within  cluster  c  we  should  maximize  the  log-likelihood 


c 

In  L (0)  =  ^  In  f(yic, . . . ,  yNcC,  xi c, . . . ,  xNcC ,  0). 

c=  1 


However,  the  joint  density  may  be  difficult  to  work  with  or  difficult  to  obtain  because 
for  many  univariate  densities  there  can  be  a  limited  range  of  multivariate  densities. 
Instead,  we  may  maximize 

c 

Q(6)  =  WOdc,  xlc,  G)x  ■  ■  ■  x  f(yNe,  xNc,  0)} 

C—  1 

=  EEln^-x^0)’ 

c=  1  7=1 

which  is  no  longer  a  true  likelihood  function,  unless  yJC  are  independent  over 
j,  so  the  information  matrix  equality  no  longer  applies.  The  preceding  formu¬ 
las  apply  with  hjc(G)  =  d  In  f(yjc,  xjc,  G)/dG  and  dhjc(6)/d0'  =  d2  In  f(yjc,  xjc, 
9)/dGdG'. 
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This  means  that  within  each  cluster  we  do  not  use  the  likelihood  score  for  each 
observation  as  in  the  case  of  independent  observations;  instead,  we  replace  it  by  the 
sum  of  likelihood  scores  over  all  cluster  elements. 


Nonlinear  Cluster-Specific  Random  Effects 

A  quite  general  setup  for  cluster-specific  effects  in  nonlinear  models  is  to  consider  the 
estimator  that  minimizes  or  maximizes 


C  Nc 

Q(P,0Cu  ...,occ)  =  EE  q(yJc,xjc,  /3,ac),  (24.46) 

e=i  7=1 

where  cluster  effects  enter  only  via  the  scalar  parameter  ac,  c  =  1 , ,C. 

A  simple  random  effects  model  assumes  that  the  ac  are  iid  with  parameters  6.  Tak¬ 
ing  expectation  with  respect  to  ac  yields  the  objective  function 


Q(0,S) 


q(y  jci  *jc,  (3,ac)f(ac\6)dac. 


Estimation  can  be  complicated,  especially  if  there  is  no  closed-form  expression  for  the 
integral  of  the  sum. 

Often  it  is  easy  to  obtain  the  expectation  with  respect  to  one  observation, 
Eac[q(yjc,  XjC,  /3,ac)]  =  q*(yjc,XjC,  /3,6).  Then  the  simpler  estimator  that  ignores 
clustering  and  minimizes  Q*(j3,8)  =  q*(yjC,  x;c,  (3,8)  will  be  consistent, 

though  the  standard  errors  need  to  be  adjusted  for  clustering  using  (24.45). 

For  example,  with  count  data  we  can  develop  a  clustered  analogue  of  the  panel 
data  Poisson-gamma  mixture  model.  However,  the  Poisson  quasi-MLE  that  ignores 
clustering  can  still  be  used  as  it  is  consistent,  though  standard  errors  need  to  be  adjusted 
for  clustering. 

Therefore,  although  random  effects  versions  of  nonlinear  models  can  be  developed, 
it  is  often  adequate  to  estimate  parameters  by  ignoring  clustering  and  then  correct  the 
standard  errors  of  estimators  for  the  clustering.  There  can  be  little  reason  for  estimation 
of  clustered  random  effects  models,  aside  from  the  potential  for  efficiency  gains. 


Nonlinear  Cluster-Specific  Fixed  Effects 

Nonlinear  variants  of  the  cluster-specific  fixed  effects  model  again  maximize  or  mini¬ 
mize 

C  Nc 

Q(f3,au  ...,ac)  =  EE  q(yjc,  x7'e>  (3,01c), 

c=  1  7=1 

as  in  (24.34),  except  now  the  parameters  a\ , . . . ,  ac  are  estimated  rather  than  inte¬ 
grated  out. 

For  large  clusters,  that  is,  C  small  and  Nc  — »■  oo,  we  simply  optimize 
Q(/3,ai, . . . ,  ac)  with  respect  to  (3  and  a\ , ,ac-  Assuming  that  a\ , ,ac  com¬ 
pletely  control  for  any  clustering,  inference  can  be  based  on  standard  errors  obtained 
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under  the  usual  iid  assumptions.  This  is  the  nonlinear  analogue  of  the  cluster-specific 
dummy  variable  model  (24.41). 

For  small  clusters,  that  is,  Nc  small  and  C  — >  oo,  we  have  the  problem  of  too  many 
incidental  parameters  a\ ,  . . . .  o/c.  Unlike  the  linear  model  it  is  generally  not  possible  to 
eliminate  the  parameters  , . . . ,  ac  (Hall  and  Severini,  1998).  However,  from  Chapter 
23  on  panel  data  we  see  that  it  is  possible  in  some  cases. 

For  example,  the  binary  logit  model  with  cluster  fixed  effects  specifies 


Pr[; yjc  =  1]  = 


l 


1  +  exp(-ac  -  x'Jcj3) 


(24.47) 


where  for  identification  x;e  cannot  include  an  intercept  or  cluster-invariant  regressors. 
The  fixed  effects  ac  can  be  eliminated  using  the  conditional  MLE  that  conditions  on 
the  sum  of  responses  within  a  cluster,  yjc  =  Ncyc.  The  joint  conditional  proba¬ 
bility  for  the  cth  cluster  is 


Pr  [yic,  •  •  • ,  yNcc  \Ncyc  ] 


exp  (p  Y.%  i  xjcyjc) 

J2de~Bc  exp  (p  T.%  1  Xjcdjc) 

r[Ej,Wr  +  i]r[»r-E"i,+i] 

r  (Nc  + 1) 


(24.48) 


where  B,  =  {{d\c, . . . ,  dNcC)  \  dnc  =  0  or  1,  and  ]T;  djc  =  yjc}-  The  conditional 
likelihood  is  the  product  over  all  clusters  of  terms  such  as  these,  with  clusters  of  size 
one  excluded  from  the  likelihood.  The  second  term  on  the  right-hand  side  does  not 
depend  on  the  unknown  parameters  and  hence  does  not  affect  the  maximization  of 
the  likelihood,  so  it  can  be  ignored  when  considering  maximization.  The  likelihood  is 
awkward  to  maximize  because  the  set  Bc  ranges  over  the  many  ways  of  choosing  Nc 
outcomes  yjc  =  1  from  (A/jc  +  Nqc)  total  outcomes  in  cluster  c.  Fortunately,  however, 
a  number  of  popular  computer  packages  provide  the  conditional  logit  option  for  esti¬ 
mating  this  model.  The  covariance  matrix  of  all  unknown  parameters  is  estimated  by 
the  inverse  of  the  log-likelihood  Hessian. 

As  another  example,  consider  the  Poisson  fixed  effects  cluster  model,  which  spec¬ 
ifies 


yjc  ~  V[njc  =  ac  exp(x'/c/3)],  c  =  1 , . . . ,  C, 

where  V[-]  denotes  the  Poisson  distribution,  and  x/c  excludes  an  intercept  and  any 
cluster-invariant  regressors.  This  is  the  usual  Poisson  model,  except  that  the  usual  con¬ 
ditional  mean  exp (x'-c/3)  is  scaled  multiplicatively  by  the  cluster-specific  fixed  effect 
ac.  For  this  particular  model  a  variety  of  approaches,  including  conditional  ML  and 
concentrated  ML,  lead  to  elimination  of  the  parameters  ac.  Consistent  estimates  of  the 
parameters  p  can  be  obtained  by  solving  the  estimating  equations 


C  Nc 

EE 


c=\  7=1 


=  0, 
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where  Xjc  =  exp(x'. /3)  and  yc  =  N~'  JT  y/c  and  I(.  =  Air1  JT  A.y-C  are  cluster 
means.  For  further  details  see  the  discussion  of  this  model  in  the  panel  data  case  in 
Section  23.7. 


24.5.7.  Further  Methods  for  Clustered  Data 

The  essential  feature  of  clustering  is  that  there  is  dependence  across  observations. 
A  related  topic  is  spatial  correlation  (see  for  example  Anselin  (2001),  Lee  (2004)), 
where  the  observational  unit  is  a  region,  such  as  a  state,  and  observations  in  regions 
close  to  each  other  are  likely  to  be  correlated. 

The  random  effects  approach  can  be  generalized  to  consider  slope  coefficients  as 
well  as  the  intercept.  This  is  presented  in  the  next  section  for  hierarchical  linear 
models.  For  nonlinear  models  the  issues  are  similar  to  those  for  panel  data  presented 
in  Chapter  23. 

The  bootstrap  can  be  used  to  obtain  cluster-robust  standard  errors,  in  settings  where 
clustering  leads  to  correlation  within  a  cluster  but  does  affect  estimator  consistency. 
Intuitively,  one  should  resample  with  replacement  over  clusters  c,  in  which  case  we 
require  small  clusters  with  C  oo.  At  the  bth  bootstrap  replication  we  draw  C  clus¬ 
ters  with  replacement  and  use  all  of  the  households  j  in  these  C  resampled  clusters  to 
estimate  the  9 that  solves  (24.44).  Then  one  can  estimate  V[0]  by  applying  the  usual 
sample  variance  formula  to  0i, ...  ,0b,  where  B  is  the  number  of  bootstrap  replica¬ 
tion.  Note  that  the  resampling  is  done  over  clusters  rather  than  households,  since  it  is 
clusters  that  are  assumed  to  be  iid  whereas  there  is  within-cluster  dependence. 


24.6.  Hierarchical  Linear  Models 

Section  24.5  restricted  the  role  of  cluster  effects  in  the  random  effects  model  to  be 
confined  to  the  regression  intercept.  A  more  general  random  effects  model  allows 
clusterwise  variation  in  the  slope  parameters  also.  Intercluster  variation  in  a  subset 
of  regression  parameters  could  be  linked  to  observable  cluster  characteristics.  Be¬ 
cause  such  models  involve  several  layers  of  specification,  they  are  called  hierarchical 
models. 

A  standard  framework  for  clustered  data  in  many  applied  statistics  disciplines  is 
that  of  hierarchical  linear  models,  also  called  multilevel  linear  models,  random  co¬ 
efficients  models,  variance  components  models,  and  mixed  linear  or  mixed  effects 
models.  This  class  of  models  brings  into  the  specification  additional  information.  We 
begin  with  a  presentation  of  the  model  for  individuals  clustered  in  groups.  Then  the 
model  is  adapted  to  short  panels  where  repeated  measures  data  are  clustered  for  each 
individual. 


24.6.1.  Model  Structure 

A  hierarchical  or  multilevel  model  is  a  model  that  can  be  applied  to  data  with  a  nested 
structure.  Examples  are  data  on  individuals  within  a  region,  such  as  a  state  or  country, 
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or  within  an  organizational  unit,  such  as  a  school  or  community,  or  within  a  family  if 
siblings  data  are  used.  Panel  data  are  also  an  example,  with  repeated  measures  on  the 
same  individual  interpreted  as  observations  that  are  nested  within  an  individual. 

We  begin  with  a  linear  model 

yij=^ijl3j  +  uij,  (24.49) 

where  the  innovation  is  to  let  the  K  regression  parameters  (3  vary  by  group  (or  cluster) 
j .  A  concrete  example  is  to  consider  data  on  students  within  schools.  Then  _y(/  is 
an  outcome  measure  such  as  test  score  for  the  ith  student  in  the  /ill  school,  and  the 
marginal  effect  of  a  change  in  a  regressor  such  as  race  of  the  student  varies  across 
schools.  Note  that  the  standard  hierarchical  linear  model  (HLM)  notation,  which  we 
use,  reverses  the  subscripts  compared  to  those  in  Section  24.5  where  ycj  would  be  the 
test  score  for  the  /  th  student  in  the  cth  school. 

The  two-level  hierarchical  linear  model  specifies  the  coefficients  in  the  level-one 
model  (24.49)  to  be  determined  by  a  linear  function  of  a  random  term  and  level-two 
variables,  here  school  characteristics.  Begin  with  the  scalar  parameter  f}kj,  the  Ath  com¬ 
ponent  of  the  K  x  I  vector  parameter  /3  •.  Then  /T;  is  modeled  as  depending  on  a 
vector  of  school  characteristics  w*  that  take  value  w *,•  for  the  /th  school,  with 

Pkj  =  w'kjlk  +  vkj,  k  =  \ . . . . ,  K.  (24.50) 

where  the  first  component  of  w kj  is  usually  a  constant.  Stacking  over  all  K  components 
of  (3  we  have 


~N~ 

fw'u 

0 

0  " 

~7i 

"  Vlj  " 

-  Pkj  _ 

— 

1 

0  0 

0 

1 

o  J? 

_7  K_ 

+ 

-  VKj  _ 

or  in  obvious  matrix  notation 

Pj=Vfn  +  vj.  (24.51) 

The  model  (24.50)  is  flexible  and  nests  many  models  as  special  cases.  These  special 
cases  include  models  with  random  intercepts  and  random  slopes,  but  the  framework 
additionally  permits  regression  coefficients  to  vary  with  level-two  observables  w, .  The 
range  of  models  is  very  broad  as  the  following  indicates. 

The  A  th  level-one  coefficient  is  called  a  fixed  coefficient  if  /i/;/  =  yk,  in  which  case 
the  coefficient  does  not  vary  with  level-two  regressors  or  with  unobservables.  If  all 
level-one  coefficients  are  fixed  the  model  (24.49)  reduces  to  y,y  =  x'.  7  +  ;/(/,  in  which 
case  estimation  by  OLS  regression  is  appropriate.  Note  that  the  term  fixed  coefficient 
has  a  very  different  meaning  to  the  term  fixed  effect  used  by  econometricians  in  the 
panel  context. 

The  Ath  level-one  coefficient  is  said  to  be  a  nonrandomly  varying  coefficient  if 
Pkj  =  w kj'Yk-  Then  the  coefficient  is  a  linear  function  of  school  characteristics.  If  all 
level-one  coefficients  are  fixed,  except  that  the  intercept  is  nonrandomly  varying,  the 
model  (24.49)  reduces  to  y,,  =  x-  /3  +  w ,  ■  7 ,  +  Wj,  which  is  a  standard  OLS  regres¬ 
sion  of  the  outcome  on  individual  characteristics  and  school  characteristics. 
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The  kth  level-one  coefficient  is  said  to  be  a  randomly  varying  coefficient  if 
fikj  =  yk  +  v kj.  Then  the  coefficient  is  purely  random  and  does  not  vary  with  school 
characteristics.  If  all  level-one  coefficients  are  randomly  varying,  so  that  /3;  =  7  +  v  ■, 
the  model  is  a  variance  components  model  or  random  coefficients  model.  If  all  level- 
one  coefficients  are  fixed,  except  that  the  intercept  is  randomly  varying,  then  the  model 
(24.49)  reduces  further  to  y,7  =  xL/3  +  v\j  +  wl;,  which  is  a  random  intercept  model. 

In  practice  some  of  the  level-one  coefficients  may  be  both  nonrandomly  and  ran¬ 
domly  varying,  as  in  the  general  case  (24.49).  If  just  the  level-one  intercept  follows 
the  general  model  (24.49)  whereas  all  other  level-one  coefficients  are  fixed,  the  model 
(24.49)  reduces  to  y,;-  =  x-  -/3  +  w'|;7|  +  v\j  +  w, 7.  This  is  the  usual  pooled  regression 
model,  with  error  that  has  two  components  and  is  therefore  correlated  across  individ¬ 
uals  at  the  same  school. 

The  HLM  framework  can  be  extended  to  additional  levels.  For  example,  individual 
students  (subscript  i)  may  be  nested  in  schools  (subscript  /),  which  are  nested  in  a 
region  (subscript  k).  Then  the  three-level  HLM  specifies  at  the  first  level  the  student 
outcome  y-tjk  =  z 'i:k'Kjk  +  eijkj  where  the  parameters  7 vjk  =  Xjkf3k  +  Ujk,  and  in  turn 
Pk  =  W;7  +  w*. 

The  HLM  can  be  reexpressed  as  a  mixed  linear  model,  since  substituting  (24.50) 
into  (24.49)  yields 


>'U  -  (x';  W;)7  +  xkv,  +  uu.  (24.52) 

The  goal  is  to  estimate  the  regression  parameter  7  and  the  variances  and  covariances 
of  the  errors  m(/-  and  v;  .  Since  the  errors  are  assumed  to  be  independent  of  regres¬ 
sors  pooled  OLS  estimation  of  (24.52)  yields  consistent  parameter  estimates  of  7.  The 
HLM  approach  uses  more  efficient  estimators  that  exploit  assumptions  on  the  vari¬ 
ances  and  covariances  of  the  errors  Ujj  and  v; . 

In  the  simplest  case  vkj  are  assumed  to  be  iid  Af[0,  a2]  and  v;  is  assumed  to  be  iid 
Af[0,  T].  Then  the  model  can  be  represented  as 

yu  ~  cr2], 

pj  ~  Aaw/7,  T]. 


An  early  treatment  of  this  was  provided  in  a  Bayesian  setting  by  Lindley  and  Smith 
(1972),  in  which  7  are  called  hyperparameters,  which  in  more  general  models  can 
themselves  depend  in  turn  on  higher  level  hyper  parameters.  The  parameters  7,  a2, 
and  r  can  be  estimated  by  maximum  likelihood  methods  or  by  Bayes  methods.  Alter¬ 
natively,  ML  methods  can  be  used  that  are  essentially  the  same  as  those  for  the  mixed 
linear  panel  data  model  presented  in  Section  21.7.  A  complete  treatment  is  given  in 
Bryk  and  Raudenbush  (1992,  2002). 


24.6.2.  HLM  for  Panel  Data 

The  HLM  literature  interprets  a  short  panel  as  repeated  measures  for  an  individual. 
Then  the  individual  becomes  level  two  in  the  two-level  HLM,  whereas  the  individual 
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was  level  one  in  the  preceding  section.  The  model  (24.28)  becomes 

yti=x'tiPi+u,i,  (24.53) 

where,  for  example,  yti  denotes  an  outcome  measure  at  time  t  for  student  i,  and  the 
marginal  effect  of  changes  in  regressors  such  as  specific  subjects  studied  varies  across 
students.  The  scalar  parameter  ftki,  the  k\h  element  of  the  K  x  I  vector  parameter  /3;  , 
is  modeled  as  depending  on  a  vector  of  individual  characteristics  wy  that  takes  value 
w ki  for  the  / th  individual,  with 

ftki  =  v/’kilk  +  Vki,  i  =  1,  •  •  • ,  N.  (24.54) 

The  individual-specific  effects  model  is  the  special  case  that  all  level-one  coeffi¬ 
cients  are  fixed,  so  ft *,•  =  y*,  except  that  the  intercept  term  ftu  can  vary  across  individ¬ 
uals  (the  level-two  grouping). 

The  individual-specific  fixed  effects  model  arises  if  there  is  no  model  for  the  inter¬ 
cept  ftu,  but  instead  ft  \ ,  is  directly  estimated.  This  is  an  extreme  case  of  a  nonrandomly 
varying  coefficient,  with  ft\,  =  wj 7 , ,  where  wi,  is  an  N  x  1  vector  of  indicator  vari¬ 
ables  with  /th  component  equal  to  one  if  i  =  l  and  equal  to  zero  otherwise  so  that 
ft  1 ,  =  y\ i .  The  HLM  framework  is  not  designed  to  accommodate  what  econometri¬ 
cians  call  the  fixed  effects  model. 

The  individual-specific  random  effects  model  arises  if  the  intercept  ft  \ ,  is  a  ran¬ 
domly  varying  coefficient,  so  that  ftu  =  yt  +  i>1(  .  Clearly,  much  more  general  random 
effects  models  can  be  specified  with  ft/,,  also  depending  on  regressors  W/., . 

As  already  noted,  the  HLM  is  a  mixed  linear  model.  For  the  panel  data  case  the 
analogue  of  (24.52)  is 

yti  =  (x(iWi)7  +  x'ri\j  +  uti. 

The  random  effects  model  of  Chapter  21  is  the  specialization  to  yr,  =  xft-f  +  u;  + 

A  standard  panel  application  of  the  HLM  framework  is  to  growth  models,  where 
the  outcome  y„  is  individual  intelligence  or  height,  which  is  a  function  of  age,  and  the 
marginal  effect  of  age  is  permitted  to  vary  across  individuals.  Here  the  slope  coefficient 
in  addition  to  the  intercept  is  permitted  to  vary  across  individuals. 


24.7.  Clustering  Example:  Vietnam  Health  Care  Use 

In  this  section  we  focus  on  estimation  in  the  presence  of  clustering,  since  this  is  the 
most  common  complication  of  survey  data  that  appears  in  microeconometrics  research. 
The  methods  in  Section  24.5  are  implemented. 

Both  linear  and  nonlinear  regression  models  are  estimated  based  on  individual-  and 
household-level  data  from  the  World  Bank’s  Vietnam  Living  Standards  Survey  (VLSS) 
of  1997-1998.  The  survey  collected  detailed  information  on  a  variety  of  topics  from 
over  27,700  individuals  in  approximately  6,000  households  distributed  over  approxi¬ 
mately  194  communes.  In  what  follows  “commune”  is  treated  as  a  cluster  or  a  group 
and  it  is  hypothesized  that  the  observed  outcomes  are  correlated  within  a  commune. 
Average  cluster  size  in  the  household  sample  is  about  26,  maximum  cluster  size  is  39, 
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and  minimum  cluster  size  is  1 .  To  illustrate  linear  and  nonlinear  cluster  models  three 
outcomes  will  be  modeled. 

First,  we  consider  a  (log)linear  regression  model  of  total  annual  household  health 
care  expenditure  (LNEXP12M),  for  households  with  positive  expenditure,  as  a  func¬ 
tion  of  the  (log  of)  total  household  expenditure  (FIHEXP),  controlling  for  several  stan¬ 
dard  sociodemographic  variables,  a  type  of  “Engel  curve”  for  health  care  expenditure. 
Of  interest  is  the  coefficient  of  total  household  expenditure,  which  is  an  estimate  of  the 
household  income  elasticity  of  demand  for  health  care. 

Second,  we  use  information  on  individual  responses  to  estimate  clustered  count 
models  for  a  type  of  health  care  that  accounts  for  a  high  proportion  of  aggregate  private 
health  care  expenditure.  In  modeling  these  outcomes  we  control  for  recent  health  status 
of  an  individual,  household  income,  health  insurance  status,  and  various  demographic 
variables  such  as  age,  sex,  marital  status,  and  educational  attainment  of  the  head  of 
the  household.  Information  about  health  status  was  restricted  to  ILLNESS  or  INJURY 
sustained  in  the  survey  period,  the  duration  of  illness,  and  number  of  days  of  restricted 
activity.  The  key  coefficients  of  interest  are  again  the  coefficients  on  the  income  and 
insurance  status  variables. 

Table  24.3  provides  the  definitions  and  summary  statistics  for  variables  used  in 
these  examples. 

In  both  cases  the  key  issues  are  the  following:  What  is  the  impact  of  clustering  on 
the  estimate  of  this  elasticity?  How  does  the  elasticity  and  its  impact  vary  as  different 
statistical  assumptions,  models,  and  estimators  are  used? 


24.7.1.  Results  and  Discussion 

Table  24.4  gives  the  results  for  the  OLS  regression,  HC  f-ratios,  fixed  effects,  and  ran¬ 
dom  effects  formulations.  There  is  a  relatively  minor  change  in  standard  errors  result¬ 
ing  from  the  use  of  a  heteroskedastic-consistent  variance  estimator  that  does  not  take 
account  of  the  clusters.  However,  when  the  cluster-robust  variance  estimator  (24.34)  is 
used  there  is  a  substantial  change  in  the  standard  errors.  The  f -ratio  for  the  expenditure 
elasticity  drops  from  16.01  to  12.68.  All  f-ratios  become  smaller  and  those  for  the  two 
variables  SEX  and  HHSIZE  fall  below  1 .96.  These  results  suggest,  as  expected,  that 
ignoring  intracluster  correlation  causes  inflation  in  the  OLS  f-ratios. 

The  /-’-tests  of  the  null  hypothesis  that  all  fixed  effects  are  equal  rejects  the  null. 
The  fixed  effects  results  have  essentially  the  same  pattern  but  note  that  the  f-ratios  are 
even  smaller.  The  point  estimate  of  the  income  elasticity  is  now  0.60  compared  with 
0.67  in  the  OLS  results.  However,  overall  there  is  no  significant  shift  in  the  inference 
about  the  role  of  different  variables. 

A  /2(1)  score  test  of  the  null  hypothesis  that  the  random  variation  in  the  intercept 
is  zero,  based  on  (24.43),  easily  rejects  the  null,  indicating  that  the  RE  model  is  an 
improvement  over  the  restricted  regression.  However,  the  estimated  RE  model  also 
does  not  result  in  a  significant  change  in  the  assessment  of  the  role  of  different  vari¬ 
ables.  As  expected  the  results  presented  under  the  FGLS  columns  and  the  RE  (GLS) 
columns  are  very  similar.  The  minor  differences  are  essentially  due  to  the  different 
values  used  in  the  GLS  transformation.  The  FGLS  estimates  are  based  on  jo  =  0.12, 
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Table  24.3.  Vietnam  Health  Care  Use:  Data  Description 


Household  data 

Definition 

Mean 

Standard 

Deviation 

LNEXP12M 

Total  household  health  care  expenditure 
for  12  months 

6.31 

1.59 

AGE 

Age  of  head  of  household 

48.01 

13.77 

SEX 

Equals  1  if  the  head  of  the  household  is 
female,  0  otherwise 

0.27 

0.44 

HHS1ZE 

Total  household  size 

4.73 

1.96 

URBAN 

Equals  1  if  urban  household,  zero  otherwise 

0.29 

0.45 

EDUC 

Schooling  year  of  household  head 

7.09 

4.41 

HHEXP 

Individual  data 

Total  nominal  household  expenditure  (1998 
VN  dong) 

15273 

13020 

PHARVIS 

Number  of  direct  pharmacy  visits 

0.51 

1.31 

LNMEDEXP  (>  0) 

log  (total  medical  expenditure)  for  those  with 
positive  expenditure  (1998  VN  dong) 

2.14 

1.08 

AGE 

Age  in  years 

29.7 

9.67 

SEX 

Equals  1  if  respondent  is  male 

0.51 

0.49 

MARRIED 

Equals  1  for  married  person 

0.40 

0.49 

EDUC 

Completed  diploma  level 

3.38 

1.94 

ILLNESS 

Number  of  illnesses  experienced  in 

0.62 

0.90 

past  12  months 

0.62 

0.90 

INJURY 

Equals  1  if  injured  during  survey  period 

0.62 

0.90 

ILLDAYS 

Number  of  illness  days 

2.80 

5.45 

ACTDAYS 

Number  of  days  of  limited  activity 

0.06 

1.11 

INSURANCE 

Equals  1  if  respondent  has  health  insurance 

0.16 

0.37 

coverage 

0.16 

0.37 

MEDEXP  (>  0) 

Medical  expenditure  conditional  on  positive 
expenditure 

21.04 

208 

MEDEXP 

Medical  expenditure  (1998  VN  dong) 

6.13 

112.75 

an  estimate  obtained  by  averaging  100  estimates  of  p  obtained  using  100  resampled 
pairs  of  least-squares  residuals. 

The  absolute  differences  between  FE  and  RE  results  are  relatively  small.  Informal 
comparison  does  not  suggest  that  the  FE  and  RE  fonnulations  yield  substantially  dif¬ 
ferent  results;  however,  the  Hausman  test  suggests  that  there  is  a  statistically  significant 
difference  between  the  two  sets  of  estimates. 

In  summary,  these  results  suggest  that  it  is  highly  desirable  to  make  some  adjust¬ 
ment  for  intracluster  correlation,  and  how  exactly  we  do  so  appears  to  have  a  relatively 
small  impact  on  the  results. 

Next  we  consider  the  results  for  the  counted  variable,  number  of  pharmacy  vis¬ 
its  (PHARVIS)  by  individuals,  using  the  Poisson  model.  This  is  an  interesting  vari¬ 
able  because  a  high  proportion  of  medical  expenditure  in  Vietnam  takes  the  form  of 
self-prescribed  medication  through  the  purchase  and  use  of  over-the-counter  drugs 
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Table  24.4.  Vietnam  Health  Care  Use:  FE  and  RE  Models  for  Positive  Expenditure 
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,  within  regression  R2;  R2B,  between  regression  R2;  R2,  overall  R2. 
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Table  24.5.  Vietnam  Health  Care  Use:  Frequencies  for  Pharmacy  Visits 


Visits 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10+ 

PHARVIS 

20639 

3827 

1716 

776 

359 

174 

64 

43 

16 

4 

115 

PHARVIS 

.744 

.137 

.062 

.028 

.013 

.006 

.002 

.001 

.000 

.000 

.004 

(fraction) 

purchased  directly  at  pharmacies.  This  form  of  health  care  is  assumed  to  be  of  lower 
quality  than  that  obtained  under  professional  supervision.  In  Vietnam  eligible  indi¬ 
viduals,  usually  high-income  government  and  private  sector  employees,  are  able  to 
purchase  health  insurance  that  entitles  them  to  obtain  care  at  government  hospitals  and 
to  obtain  prescribed  medications  there  also.  From  Table  24.3  observe  that  16%  of  the 
sampled  individuals  have  such  health  insurance. 

Table  24.5  shows  the  observed  frequency  distribution  of  PHARVIS.  About  26%  of 
the  individuals  have  one  or  more  visits  in  the  survey  period  and  around  95%  have  a 
total  of  three  or  fewer  visits. 

Table  24.6  presents  the  results  for  several  variants  of  the  Poisson  regression,  analo¬ 
gous  to  those  in  Table  24.4  for  linear  regressions.  The  first  column  gives  the  Poisson 
MLE  estimates,  and  the  ordinary  unadjusted  f-ratios  are  in  the  second  column.  The 
next  column  shows  robust  f-ratios  based  on  heteroskedasticity-consistent  variance  es¬ 
timates.  These  are  considerably  smaller,  in  some  cases  by  a  factor  exceeding  2,  than 
the  unadjusted  ones.  The  fourth  column  gives  cluster-adjusted  f-ratios  based  on  vari¬ 
ances  calculated  using  (24.45).  The  fact  that  these  are  substantially  smaller  than  those 
in  the  two  preceding  columns  confirms  that  there  is  indeed  significant  intracluster 


Table  24.6.  Vietnam  Health  Care  Use:  RE  and  FE  Models  for  Pharmacy  Visits 


Variables 

Poisson 

Het 

Robust 

Cluster 

Robust 

Fixed  Effects 
Poisson 

Random  Effects 
Poisson 

Coef. 

1*1 

1*1 

1*1 

Coef. 

1*1 

Coef. 

1*1 

CONS 

-1.637 

35.78 

18.81 

12.25 

— 

— 

1.318 

19.41 

LNHHEXP 

.078 

5.68 

3.08 

1.90 

-.114 

6.01 

-.095 

4.95 

INSURANCE 

-.245 

9.57 

5.68 

4.29 

-.163 

6.17 

-.178 

6.44 

SEX 

.084 

4.96 

2.76 

2.73 

.098 

5.75 

.099 

5.71 

AGE 

.024 

2.38 

1.27 

1.06 

.003 

0.32 

.005 

0.55 

MARRIED 

.124 

5.92 

2.96 

2.78 

.164 

7.59 

.158 

7.38 

ILLDAYS 

.042 

40.00 

14.91 

12.91 

.046 

40.14 

.046 

40.18 

ACTDAYS 

.008 

1.71 

0.43 

0.45 

.025 

4.53 

.024 

4.35 

INJURY 

.171 

2.30 

0.84 

0.85 

.144 

1.80 

.143 

1.80 

ILLNESS 

.562 

87.15 

24.60 

21.81 

.584 

73.45 

.585 

74.16 

EDUC 

-.052 

11.10 

6.47 

3.92 

-.024 

4.18 

-.026 

4.61 

—In  L 

25281 

22446 

23419 

N 

27765 

27671 

27765 
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correlation.  The  average  cluster  size  exceeds  140  observations;  hence  even  a  low  de¬ 
gree  of  intracluster  correlation  is  likely  to  inflate  f -ratios  substantially  and  the  results 
confirm  that. 

We  next  consider  modeling  the  intracluster  correlation  using  FE  and  RE  models. 
The  FE  model  is  estimated  using  the  conditional  MLE.  Some  clusters  that  do  not  have 
sufficient  intracluster  variation  are  dropped.  The  estimated  coefficients  lead  to  dramat¬ 
ically  different  conclusions  from  those  of  the  Poisson  MLE  estimates.  First,  note  that 
the  coefficient  of  ln(FIFIEXP)  switches  from  being  significantly  positive  to  being  sig¬ 
nificantly  negative.  This  means  that  the  original  regression  suggested  that  a  pharmacy 
visit  is  a  normal  good,  but  the  FE  estimates  suggest  that  it  is  an  inferior  good;  that  is, 
individuals  avoid  this  form  of  self-medication  as  income  rises.  This  can  be  rationalized 
as  the  fixed  effects  picking  up  the  influence  of  omitted  variables  that  are  correlated  with 
the  observed  outcomes.  These  omitted  variables  could  be  the  quantity  and  quality  of 
alternative  medical  services  available  to  commune  residents.  These  could  vary  a  great 
deal  depending  upon  the  geographical  location  and  economic  status  of  the  communes. 

The  last  two  columns  in  Table  24.6  give  results  based  on  random  effects  formula¬ 
tion.  Flere  it  is  assumed  that  the  intercept  in  the  Poisson  distribution  varies  randomly 
over  clusters,  and  each  cluster  “draws”  its  intercept  from  a  common  univariate  distri¬ 
bution,  specifically  a  gamma  distribution  with  unit  mean.  This  formulation  is  attractive 
because  it  does  not  require  conditioning.  The  RE  Poisson  panel  model  with  gamma- 
distributed  intercept,  developed  by  Flausman  et  al.  (1984),  has  an  analytical  likelihood 
function  that  can  be  adapted  for  clustered  data.  The  estimates  obtained  for  the  RE 
model  are  qualitatively  similar  to  those  from  the  FE  model.  Flowever,  the  estimated 
coefficient  for  the  key  income  variable  has  shifted  a  long  way  from  that  obtained  un¬ 
der  the  simple  Poisson  assumption. 

This  example  shows  that  intracluster  correlation  may  have  an  impact  not  just  on 
efficiency  alone  but  also  on  the  estimates  themselves. 


24.8.  Complex  Surveys 

The  discussion  in  preceding  sections  focused  on  stratification,  weighting,  and  clus¬ 
tering  in  isolation.  Here  we  focus  on  complex  surveys  that  use  a  stratified  multistage 
cluster  sampling  design.  The  intent  of  such  surveys  is  to  present  a  population  summary 
when  population  parameters  may  vary  across  strata.  Then  a  weighted  estimator  is  used 
and  is  viewed  as  an  estimate  of  the  census  parameter.  The  goal  is  to  consistently  esti¬ 
mate  the  variance  of  the  weighted  estimator,  controlling  for  clustering  that  can  be  more 
complicated  than  that  in  Section  24.5. 


24.8.1.  Variance  Estimation  in  Complex  Surveys 

We  consider  the  following  setup.  The  ith  observation  in  the  sample  is  household  j  in 
cluster  c  in  strata  s.  For  example,  the  dependent  variable  is  denoted  ySCj,  though  more 
formally  the  observation  (j,  c,  j)  may  be  represented  as  observation  (s,  cs,  jCs).  The 
data  are  (ySCj,  xvc/,  wSCj),  where  wSCJ  are  sample  weights  inversely  proportional  to  the 
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probability  of  selection  of  the  observation  in  the  sample.  The  subscripts  are  ordered  in 
terms  of  level  of  disaggregation,  a  reversal  from  the  notation  of  Section  24.5. 

Two-stage  or  multistage  sampling  is  used  within  strata,  with  households  selected  as 
the  result  of  at  least  two  sequential  draws.  First,  a  subset  of  all  PSUs  within  the  strata 
is  randomly  drawn.  Second,  a  subset  of  all  households  in  the  selected  PSUs  is  drawn, 
where  clustered  sampling  may  be  permitted.  Further  draws  within  an  SSU  and  so  on 
are  also  possible. 


Variance  of  a  Linear  Statistic 

The  starting  point  is  to  consider  estimation  of  the  variance  of  a  linear  statistic  that  sums 
over  strata,  PSU,  and  households: 

S  C,  Na 

Uscj 

5=1  C=  1  7  =  1 

where  usc  are  the  totals  within  a  PSU,  so 

Ncs 

Usc  —  ^  '  U scj . 

7=1 

Examples  of  uSCj  such  as  the  weighted  mean  and  weighted  regression  are  given  in  the 
following.  The  variance  of  u  is 

sc,  s 

vm  =  EEV[m«]  =  Lc^2’ 

5=1  C=  1  5=1 

if  we  assume  that  usc  are  independent  over  strata  and  are  iid  over  PSUs  with  common 
variance  rrs2.  The  usual  unbiased  variance  estimate  of  <xs2  can  be  used,  given  usc  iid  over 
c,  so  ct2  =  (Cs  —  IV1  J]c.(wST  —  us)2.  It  follows  that 

s  c  c, 

V[«]  =  c  L  \  (24.55) 

5=1  S  C—  1 

where  us  =  C~l  c  usc  is  the  stratum  average  of  the  PSU  totals. 

This  estimator  allows  for  clustering  within  a  PSU,  since 

cs  cs  /  ncs 

^  ^(^5C  W5)  !  =  E  E  Uscj  U  v 

e=l  c=l  \./=l 

C,  Nc,  C,  N„  N„ 

—  ^  ^  ^  1  Uscj  Us)~  T  ^  ^  ^  ^  ^  Uscj  Us^Usck  if?)- 

c=  1  7=1  c= 1  j= 1  kjtj 

The  first  sum  is  the  contribution  to  the  variance  under  SRS.  The  second  sum  will 
be  positive  under  clustered  sampling  and  leads  to  a  larger  variance.  No  assumption 
has  been  made  about  the  nature  of  the  sampling  within  strata  nor  about  the  type  of 
clustering  that  arises.  For  example,  (24.55)  gives  correct  standard  errors  even  if  there 
is  three-stage  sampling  with  further  subsampling  with  SSUs. 


s  c, 

5=1  C—  1 
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The  estimator  (24.55)  does  require  that  at  least  two  PSUs  be  drawn  from  each  strata. 
If  only  one  PSU  is  drawn  then  one  possibility  is  to  collapse  the  strata  that  includes  the 
single  PSU  into  another  strata  that  is  viewed  a  priori  as  being  reasonably  similar.  It  is 
feasible  provided  Cs  >  2,  that  is,  if  there  are  at  least  two  PSUs  per  stratum.  This  will 
lead  to  overestimation  of  V  [  n  ]  as  an  upward  bias  is  introduced  because  of  the  different 
means  in  different  strata.1 

In  practice  PSUs  are  sampled  without  replacement  so  there  is  some  dependence  in 
usc.  Then  (24.55)  overestimates  V[n],  similar  to  the  situation  in  Section  24.2.3.  More 
complicated  formulas  have  been  proposed. 

Variance  of  the  Weighted  Mean 

The  population  mean  is  estimated  by  the  ratio  of  the  sample- weighted  total  of  yscj,  say 
~y,  to  the  sum  of  the  sample  weights,  say  ui.  Then 

S  C,  Nc,  I  S  C,  Nc, 

yw  =  y/w  =  £  £  w’cjy,cj  /  “W- 

5=1  c=  1  7=1  /  5=1  c=  1  7=1 

If  the  sample  weights  are  treated  as  known,  then  more  simply 

S  Cs  Ncs 
5=1  C=  1  7  =  1 

where  it)*  -  =  wSCj/w  and  V[yw]  can  be  applied  using  (24.55)  with  uscj  =  w*cjySCj. 

If  the  sample  weights  are  treated  as  unknown  then  the  delta  method  or  lineariza¬ 
tion  method  can  be  used  to  obtain  V[y/u)]  as  a  function  of  V[  v],  V[u>],  and  Cov[y  ,  ui]. 
The  first  two  quantities  can  be  estimated  using  (24.55)  with  uSCj  =  wSCjySCj  and 
uscj  =  wSCj.  The  third  quantity  can  be  estimated  with  (usc  —  us)2  in  (24.55)  replaced 
by  ( usc  —  us)(vsc  —  vs),  where  uSCj  =  wSCjyScj  and  vSCj  =  wSCj.  This  is  an  example  of 
a  ratio  estimator. 

For  nonlinear  statistics  such  as  these  ratio  estimates,  the  literature  proposes  other 
estimates  based  on  the  jackknife  and  balanced  repeated  replication.  Because  of  the 
nonlinearity  the  variance  estimates  are  no  longer  unbiased  but  can  be  shown  to  be 
consistent  if  the  number  of  strata  S  oo  (see  Krewski  and  Rao,  1981).  Some  results 
with  S  fixed  and  Ncs  — »■  oo  are  summarized  in  Wolter  (1985).  One  can  also 
bootstrap,  though  care  is  needed.  See  Rao  and  Wu  (1988)  and  Shao  and  Tu  (1995). 

Variance  of  Weighted  Least-Squares  Estimator 

From  Section  24.3,  the  weighted  regression  estimate  /3W  of  the  census  regression  pa¬ 
rameters  solve 

5  C.,  Nc, 

wscj^scj(yscj — xSC773w) = 

5=1  C=  1  7=1 

1  For  the  CPS  the  method  here  cannot  be  directly  applied  as  many  strata  have  only  one  PSU  and  for  other  strata 
only  one  PSU  is  collected.  Instead,  various  pseudo-strata  are  formed  and  replication  methods  are  used  that 
resample  PSUs  from  the  pseudo- strata.  See  U.S.  Census  Bureau  (2002). 
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By  the  usual  algebra,  we  have 

(S  C,  N„  \-1  SC,  Nc,  ^ 

El  W.sc/X.ycjXjCj  J  X  E^  y  '  Wscjiyscj  —  xscj@w)- 

S=  1  C=1  j=l  /  i=l  C=1  j=l 

This  leads  to  the  sandwich  form  V[/3]  =  A  1  BA  ,  where  B  is  the  variance  of  the 
second  triple  sum,  which  can  be  estimated  using  (24.55)  with  uSCj  =  wSCjXSCj(ySCj  — 
XscjP  w)- 


Variance  of  Weighted  m-Estimator 

A  quite  general  framework  considers  the  weighted  m-estimator  #w  that  solves 

S  C,  Nc, 

y '  y  '  y  '  ^\r;  ii ,, .  xscj,  Gw  i  o. 

s=  1  c=  1  7=1 

Examples  include  linear  regression,  hsc;-  =  xscj(yscj  —  x'-/3),  and  quasi-maximum 
likelihood,  hse/  =  3  In  /<\\r/  xu:;.  6>)/90. 

Assuming  consistent  estimation  of  0,  which  requires  that  E[h(yiCj,  xSCJ ,  6)]  =  0, 
we  can  use  the  usual  first-order  Taylor  series  expansion  on  the  estimating  equation 
to  get 

Vn(G^~6)-^N  [o,  A"1  BA'-1]  , 

where 


S  C,  Nc, 


A  =  plimA  1  EEE 


W  c, 


s=l  c—  1  7  =  1 


Xscj ,  0 ) 

a#' 


and 


5  Cf  1  /-  /)\ 

T.  1-  ,,-1  "“t.Vscij  ") 

B  =  plirnlV  y_y_y_y_  wscj wsckh(yscj ,  xscj ,6)- 

5=1  C=  1  7  =  1  &=1 


90' 


where  the  expression  for  B  assumed  independence  of  hvc/-  over  strata  and  clusters  but 
permits  dependence  within  a  cluster.  Estimation  of  A  is  straightforward.  For  B  use 
(24.55)  with  uscj  =  wscjhscj,  so 


B  =  E 


C, 

cs  -  1 


Cs 

y '  Lzsc 


c=  1 


where  zsc  =  Ej=i  wscjh(yscj ,  xscj ,  6)  and  =  Cs  1  Xlf=i  zs. 


Endogenous  Stratification 

Sakata  (1998)  extends  these  results  to  endogenous  sampling.  He  takes  a  census  param¬ 
eter  approach  and  provides  asymptotic  theory  assuming  the  number  of  strata  S  — >  oo. 
The  results  are  the  same  as  those  given  in  the  previous  section. 
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24.9.  Practical  Considerations 

It  is  most  common  in  microeconometrics  research  to  take  a  structural  approach.  Un¬ 
weighted  estimators  are  used,  provided  there  is  no  endogenous  stratification.  The  main 
concern  is  to  obtain  correct  standard  errors  if  clustering  is  present.  If  cluster  effects  are 
random  there  is  usually  little  efficiency  loss  in  ignoring  clustering  in  estimation.  Some 
packages  may  have  a  cluster  robust  standard  errors  option,  not  to  be  confused  with  a 
heteroskedasticity  robust  option,  which  is  appropriate  if  cluster  effects  are  random  and 
there  are  many  clusters.  The  CSRE  and  CSFE  models  can  be  implemented  using  OLS, 
provided  in  the  case  of  CSFE  there  are  not  too  many  clusters.  Alternatively,  a  panel 
data  module  can  be  used  if  it  supports  unbalanced  panels.  As  with  panel  data  most 
researchers  outside  econometrics  are  content  to  take  a  random  effects  approach,  but  a 
fixed  effects  approach  may  be  necessary  for  consistent  estimation. 

If  a  descriptive  approach  is  taken  and  parameters  vary  over  strata  then  weighting 
is  necessary.  A  weighting  option  within  least  squares  can  be  used,  but  it  needs  to  be 
combined  with  a  cluster-robust  standard  errors  option.  Some  packages  have  a  survey 
estimation  module  that  obtains  cluster-robust  standard  errors  using  the  methods  of 
Section  24.6.  The  package  SUDAAN  implements  many  of  the  methods  in  this  chapter 
for  linear  and  leading  nonlinear  regression  models. 

24.10.  Bibliographic  Notes 

24.2-24.3  The  literature  on  survey  sampling  is  vast.  Classic  references  on  sample  surveys  in¬ 
clude  Kish  (1965)  and  Cochran  (1977,  first  edition  1953).  Skinner  (1989)  provides 
a  useful  overview  and  Groves  (1989)  provides  a  relatively  nontechnical  treatment 
that  presents  the  approaches  of  many  of  the  social  sciences  to  surveying,  while  rais¬ 
ing  many  useful  practical  issues.  For  completeness  we  have  incorporated  some  of 
this  survey  sampling  literature,  though  econometrics  studies  rarely  implement  the 
methods  in  Section  24.8.  There  are  few  econometrics  references,  with  the  notable 
exception  of  chapters  in  Pudney  (1989)  and  Deaton  (1997)  and  a  book  chapter  by 
Ullah  and  Breunig  (1998). 

24.4  The  main  focus  of  the  theoretical  econometrics  literature  has  been  controlling  for  en¬ 
dogenous  stratification.  This  literature  is  challenging  and  we  have  merely  provided 
an  overview.  For  detail  see  Amemiya  (1985),  who  provides  many  references  includ¬ 
ing  Manski  and  Lerman  (1977)  for  discrete-choice  models  and  Hausman  and  Wise 
(1979)  for  sample  selection  models.  The  simple  weighted  estimator  is  generally  ap¬ 
propriate  albeit  inefficient.  Imbens  and  Lancaster  (1996)  present  a  practical  way  to 
implement  a  fully  efficient  estimator  given  specification  of  the  conditional  density. 

24.5  For  microeconometrics  applications  controlling  for  clustering  is  of  greatest  impor¬ 
tance.  The  works  by  Kloek  (1981)  and  Moulton  (1986,  1990)  were  key  in  alerting 
econometricians  to  this  problem.  Davis  (2002)  gives  a  general  treatment  of  multi¬ 
way  error  component  models.  Graubard  and  Korn  (1994)  provide  a  useful  discussion 
of  linear  regression  analysis  of  clustered  data.  They  pay  attention  to  both  fixed  and 
random  effects  models,  with  emphasis  on  the  assumptions  that  must  be  satisfied  for 
the  random  effects  model  to  be  valid.  Pendergast  et  al.  (1996)  provide  an  extensive 
survey  of  the  methods  for  analyzing  clustered  binary  data.  Because  the  middle  term 
on  the  right-hand  side  of  (24.34)  involves  averaging  over  the  number  of  clusters,  the 
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precision  of  this  estimate  depends  on  the  number  of  clusters.  The  consequences  of 
using  the  cluster-robust  variance  matrix  when  the  number  of  clusters  is  small  con¬ 
tinues  to  be  a  topic  of  research  (Donald  and  Lang,  2001;  Angrist  and  Lavy,  2002). 
Wooldridge  (2003)  provides  an  overview. 

24.6  Hierarchical  linear  models  have  been  extensively  used  in  social  sciences.  Bryk  and 
Raudenbush  (2002)  provide  a  comprehensive  coverage  of  binary,  ordered,  counted, 
and  multinomial  outcomes  from  both  likelihood  and  Bayesian  perspectives. 

24.7  Deaton  (1997)  examines  a  number  of  issues  of  modeling  using  data  from  clustered 
samples  from  various  Living  Standards  Surveys  conducted  in  developing  economies 
by  the  World  Bank. 

24.8  Many  standard  statistical  software  packages  (e.g.,  STATA  and  SUDAAN)  accommo¬ 
date  both  fixed  and  random  effects  formulations  of  clustering  in  linear  and  nonlinear 
models  for  cross-section  and  panel  data. 


- Exercises - 

24-1  (a)  Verify  the  expression  for  Xc  given  at  (24.25). 

(b)  Prove  the  consistency  property  of  the  estimators  a2  and  p  in  the  CSRE 
model. 

(c)  Consider  the  bias  of  the  standard  errors  in  the  balanced  cluster  CSRE 
model.  Show  that  in  this  case  E  [J2cJ2j^cj\  =  °2\.N  ~  ^0  +  p(m~  1))]- 

24-2  (Adapted  from  Greenwald,  1983)  Consider  the  linear  regression  model 
y  =  X/3  +  u ,  where  E[u]  =  0  and  E[uu']  =  a2fl*  =  f l.  By  standard  results  for 
the  OLS  estimator  /3  =  (X'X)_1X'y  (see  Section  4.4)  we  can  obtain  the  correct 
expression  for  V[/3]  as  V2  —  (X;X)  ^  (X;S2X)  ^(X;X)  ,  whereas  Vi  =  ct2(X'X)-1 
with  a2  =  u'u/(A/  -  K)  is  invalid  if  fl  ^  I. 

(a)  Show  that  the  bias  of  Vi  is  given  by  6  =  6!+  B2,  where 
B2  =  (X,X)-1X,(r2-o-2l)X(X'X)-1  and  B1=(A/-K')-1  tr{B2(X'X)}(X'X)-1 . 
(Greenwald  refers  to  B2  as  “direct  bias.”) 

(b)  Evaluate  the  two  terms  for  the  special  case  of  X'X  =  lK .  Show  that  B  ->•  B2 
as  N  ->  00. 

24-3  Consider  the  OLS  cluster-robust  variance  estimator  formula  (24.33).  Suppose 
there  are  two  levels  of  clustering.  Specifically,  in  the  context  of  the  empirical 
example  of  this  chapter,  clustering  could  be  at  the  level  of  family  and  commune 
if  multiple  members  of  the  family  from  the  same  commune  are  included  in  the 
survey.  How  will  the  formula  be  modified  if  the  data  have  two  levels  of  clustering? 

24-4  For  this  exercise  use  a  50%  sample  of  the  VLSMS  data.  Define  y  =  1  if  the 
subject  has  at  least  one  pharmacy  visit  (PHARVIS)  and  y  =  0  otherwise.  This 
example  presumes  access  to  a  program  that  handles  clustering. 

(a)  Using  the  same  explanatory  variables  as  those  for  the  Poisson  model  in 
Section  24.7,  estimate  a  binary  logit  model  by  maximum  likelihood,  us¬ 
ing  both  the  standard  estimator  and  the  robust  sandwich  estimator  for  the 
variance. 

(b)  Reestimate  the  specification  of  part  (a)  using  the  cluster-robust  standard 
error  option.  Explain  the  differences  between  the  robust  standard  errors  of 
parts  (a)  and  (b). 
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(C)  Use  the  “commune”  variable  as  a  cluster  identifier.  Reestimate  the  logit 
model  using  the  cluster  fixed  effects  and  cluster  random  effects  specifi¬ 
cation.  Compare  the  estimates  and  standard  errors  of  the  coefficients  of 
LNHHEXP  and  INSURANCE.  Are  the  conclusions  about  the  significance  of 
these  variables  affected  by  clustering  in  the  data? 
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Treatment  Evaluation 


25.1.  Introduction 

The  topic  of  treatment  evaluation  concerns  measuring  the  impact  of  interventions  on 
outcomes  of  interest,  with  the  type  of  intervention  and  outcome  being  defined  broadly 
so  as  to  apply  to  many  different  contexts.  The  treatment  evaluation  approach  and  some 
of  its  terminology  comes  from  medical  sciences  where  intervention  frequently  means 
adopting  a  treatment  regime.  Subsequently,  one  may  be  interested  in  measuring  the 
response  to  the  treatment  relative  to  some  benchmark,  such  as  no  treatment  or  a  differ¬ 
ent  treatment.  In  economic  applications  treatment  and  interventions  usually  mean  the 
same  thing. 

Examples  of  treatments  in  the  economic  context  are  enrollment  into  a  labor  train¬ 
ing  program,  being  a  member  of  a  trade  union,  receipt  of  a  transfer  payment  from 
a  social  program,  changes  in  regulations  for  receiving  a  transfer  from  a  social  pro¬ 
gram,  changes  in  rules  and  regulations  pertaining  to  financial  transactions,  changes 
in  economic  incentives,  and  so  forth;  see  Moffitt  (1992),  Friedlander,  Greenberg,  and 
Robbins  (1997),  and  Heckman,  Lalonde,  and  Smith  (1999).  If  the  treatment  that  is 
applied  can  vary  in  intensity  or  type,  we  use  the  term  multiple  treatments  when  re¬ 
ferring  to  them  collectively.  Relative  to  a  single  type  of  treatment  this  does  not  create 
complications,  but  now  the  choice  of  a  benchmark  for  comparisons  is  more  flexible. 

The  term  outcome  refers  to  changes  in  economic  status  or  environment  on  eco¬ 
nomic  outcomes  of  individuals.  A  leading  case  is  one  in  which  the  outcome  of  interest 
is  a  continuous  variable,  say  y,  whereas  the  treatment  variable  is  discrete  and  of  on/off 
variety,  say  D,  where  D  takes  the  value  1  if  the  treatment  is  applied  and  is  0  otherwise. 
An  example  of  an  intervention  is  labor  market  training,  which  could  affect  posttraining 
wages  of  the  worker.  In  general,  however,  either  the  outcome  or  treatment  can  be  con¬ 
tinuous  or  discrete  or  exhibit  limited  variation.  Whereas  the  details  of  the  analysis  will 
vary,  certain  key  ideas  will  be  relevant  in  all  situations.  For  simplicity,  we  will  take  the 
case  of  a  continuous  outcome  and  a  binary-valued  treatment  as  our  leading  case.  Later 
we  will  extend  the  analysis  to  other  practically  relevant  situations. 
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Policy  relevance  of  treatment  evaluation  is  direct  because  “successful”  treatments 
can  be  linked  to  desirable  social  programs,  or  improvements  in  existing  programs  to 
attain  objectives  of  social  policy.  Heckman  and  Smith  (1998)  have  discussed  the  rela¬ 
tionship  between  several  commonly  used  measures  of  treatment  impact  and  traditional 
cost-benefit  analysis. 

The  standard  problem  in  treatment  evaluation  involves  the  inference  of  a  causal 
connection  between  the  treatment  and  the  outcome.  In  a  canonical  single-treatment  ex¬ 
ample  we  observe  (y,  ,  x,-  ,£),■),  i  =  I .....  /V,  and  the  impact  of  a  hypothetical  change 
in  D  on  y,  holding  x  constant,  is  of  interest.  Such  inference  is  the  main  feature  of 
the  potential  outcome  model,  already  introduced  in  Chapter  2,  in  which  the  outcome 
variable  of  interest  is  compared  in  the  treated  and  nontreated  states.  However,  no  in¬ 
dividual  is  simultaneously  observed  in  both  states.  Hence,  the  situation  is  akin  to  one 
of  missing  data,  and  it  can  be  tackled  by  methods  of  causal  inference  carried  out  in 
terms  of  counterfactuals.  We  ask  how  the  outcome  of  an  average  untreated  individual 
would  change  if  such  a  person  were  to  receive  the  treatment.  That  is,  a  magnitude  like 
Ay/ AD  is  of  interest.  Fundamentally  one’s  interest  lies  in  the  outcomes  that  result 
from,  or  are  caused  by,  such  interventions.  Here  causation  is  in  the  sense  of  ceteris 
paribus ,  meaning  that  we  hold  all  other  variables  constant. 

What  is  the  difference  between  this  chapter  and  earlier  ones  in  which  we  also  con¬ 
sidered  the  identification  and  estimation  of  a  variety  of  models?  There  are  many  sim¬ 
ilarities  and  the  differences  arise  from  a  shift  of  emphasis.  The  main  difference  stems 
from  the  focus  on  a  family  of  measures  of  treatment  effectiveness.  These  measures  are 
functions  of  parameters  and  data,  and  they  enable  comparisons  with  policy-relevant 
counterfactuals.  An  important  and  interesting  result  is  that  not  all  measures  can  be 
constructed,  given  the  data  and  the  estimator.  The  choice  of  an  estimator  and  the  type 
of  data  used  in  model  estimation  place  restrictions  on  the  counterfactuals  that  can  be 
identified,  and  hence  on  the  impact  measures  that  can  be  consistently  estimated. 

Another  emphasis  in  the  literature  on  treatment  evaluation  is  on  the  advantages  of 
identification  secured  using  minimal  functional  form  and  exclusion  restrictions,  (e.g., 
semiparametric  identification).  This  emphasis  is  motivated  by  the  desire  to  produce 
results  that  have  policy  significance  but  whose  validity  does  not  depend  on  strong 
assumptions.  The  feasibility  of  semiparametric  identification  is  relatively  easier  to 
establish  for  treatment  effect  estimation  in  linear  models,  with  continuous  support 
for  the  dependent  variable,  than  it  is  in  nonlinear  models  with  limited  dependent 
variables. 

Section  25.2  discusses  identification  assumptions.  Section  25.3  presents  mea¬ 
sures  of  treatment  effect  that  are  usually  targeted  in  identification  and  estimation. 
Section  25.4  analyzes  matching  and  propensity  score  estimators.  Differences-in- 
differences  estimators  of  treatment  effects  that  are  common  in  event  studies  with  a 
quasi-experimental  data  setup  are  covered  in  Section  25.5.  Continuing  with  a  quasi- 
experimental  setup,  we  discuss  the  regression  discontinuity  design  in  Section  25.6,  fol¬ 
lowed  by  the  instrumental  variable  estimator  in  Section  25.7.  Much  of  the  discussion 
up  to  this  point  is  related  to  linear  models.  Section  25.8  provides  a  detailed  empirical 
illustration  of  the  methods  developed  in  the  chapter. 
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25.2.  Setup  and  Assumptions 

The  methods  for  estimation  of  treatment  effects  rely  on  assumptions  to  permit  iden¬ 
tification  of  causal  effects  just  as,  for  example,  the  linear  SEM  relies  on  assumptions 
to  permit  causal  effects  (see  Chapter  2).  In  this  section  we  detail  the  assumptions  that 
permit  use  of  the  key  matching  and  propensity  score  estimators  that  are  presented  later 
in  Section  25.4. 

First  we  consider  a  framework  for  estimating  causal  parameters  in  treatment 
evaluation. 


25.2.1.  Treatment  Effects  Framework 

Let  us  begin  with  the  setup  of  randomized  treatment  assignment  in  a  social  experiment 
as  described  in  Section  3.3.  Let  there  be  a  target  population  for  the  treatment  of  interest 
and  let  N  denote  the  number  of  randomly  selected  individuals  who  are  eligible  for 
treatment.  Let  Nt  denote  the  number  of  randomly  selected  individuals  who  are  treated 
and  let  Nc  =  N  —  Nt  denote  the  number  of  nontreated  individuals  who  serve  as  a 
potential  control  group. 

Random  assignment  implies  that  the  treatment  assignment  ignores  the  possible 
impact  of  the  treatment  on  the  outcomes.  For  example,  no  one  is  included  in  the 
treatment  group  on  the  grounds  that  the  benefit  of  the  treatment  to  that  individual 
would  be  large,  and  no  one  is  excluded  because  the  expected  benefit  is  small.  Let 
(_>’,■ ,  x, ,  D, ;  i  =  1 , . . . ,  N)  be  the  vector  of  observations  on  the  scalar-valued  outcome 
variable  y,  a  vector  of  observable  variables  x,  and  a  binary  indicator  of  a  treatment 
variable  D.  For  simplicity,  we  assume  that  anyone  who  is  assigned  treatment  gets  it, 
and  anyone  who  is  not  does  not  get  it.  The  outcome  variable  of  the  treated  individual  is 
denoted  yi  and  that  for  the  nontreated  individual  is  denoted  vo-  After  the  experiment  is 
run  and  data  are  collected,  we  would  like  to  obtain  a  measure  of  the  treatment  impact. 
The  most  natural  way  of  measuring  the  effect  of  the  treatment  would  be  to  construct  a 
measure  that  compares  the  average  outcomes  of  the  treated  and  nontreated  groups. 

With  one  important  difference  the  same  data  setup  could  be  applied  to  observational 
data.  The  difference  is  that  there  is  no  random  assignment  mechanism  for  treatment, 
perhaps  because  individuals  choose  to  be  treated,  or  because  of  some  other  reason. 

It  needs  to  be  stated  at  the  outset  that  most  treatment  evaluation  studies  have  a  par¬ 
tial  equilibrium  character.  Specifically,  they  assume  an  absence  of  general  equilibrium 
effects.  By  that  we  mean  that  the  treatment  effects  are  small  and  do  not  affect  the  sta¬ 
tus  of  some  of  the  variables  that  are  treated  as  exogenous.  This  assumption  will  not 
do  if  one  were  considering  a  treatment  program  that  affected  an  entire  sector  that  was 
a  significant  part  of  the  national  economy.  For  example,  instituting  universal  health 
insurance  may  have  impact  on  the  entire  health  services  sector,  which  would  make  it 
difficult  to  apply  the  methods  discussed  in  this  chapter. 

There  are  potential  pitfalls  in  constructing  estimates  of  treatment  effects.  There  are 
also  subtle  differences  of  interpretations  that  arise  from  variations  in  the  assumptions 
used  to  construct  such  measures.  Therefore,  we  begin  by  examining  these  assumptions. 
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25.2.2.  Conditional  Independence  Assumption 

Meaningful  comparisons  between  the  outcomes  of  the  two  groups  require  some  as¬ 
sumptions.  We  shall  initially  list  and  explain  these  assumptions  and  later  use  them  in 
the  discussion  of  identifi ability  of  certain  treatment  effects. 

An  important  assumption  is  the  conditional  independence  assumption  that  states 
that  conditional  on  x,  the  outcomes  are  independent  of  treatment,  written  as 

An  yi  -L  D\  x.  (25.1) 

Behavioral  implication  of  this  assumption  is  that  participation  in  the  treatment  program 
does  not  depend  on  outcomes,  after  controlling  for  the  variation  in  outcomes  induced 

by  differences  in  x.  Random  assignment,  properly  applied,  will  validate  this  assump¬ 

tion.  Indeed,  under  completely  random  assignment  one  may  even  make  a  stronger 
assumption 

}’o,  }’i  -L  D,  (25.2) 

because  randomization  would  be  over  (y,  x)  space.  The  more  commonly  used  assump¬ 
tion  (25.1),  if  valid,  can  be  useful  for  identification  of  some  impact  parameters  because 
it  states  that  once  we  control  for  the  effects  of  regressors  x,  some  of  which  may  be  re¬ 
lated  to  D,  treatment  and  outcomes  are  independent. 

The  conditional  independence  assumption  is  broad  and  implies  the  following: 

F{yM,D  =  1)  =  F(yj\x,D  =  0)  =  F(y,|x),  j  =  0,  1,  (25.3) 

F(iij[x.,D  —  1)  =  F(uj\x,D  =  0)  =  F(uj  |x),  j  =  0,  1, 

where  u  is  the  regression  model  error,  which  means  that  the  participation  decision  does 
not  affect  the  distribution  of  potential  outcomes. 

To  see  the  impact  of  this  assumption  let  E[y|x,  D\  be  linear;  that  is,  the  outcome- 
participation  equation  is 

y  =  x'/3  +  ctD  +  u,  (25.4) 

where  E[u\D]  =  E [  v  —  x'/3—aD\D]  =  0.  Therefore,  D  may  be  treated  as  an  exoge¬ 
nous  variable,  and  there  will  be  no  simultaneity  bias  or  selection  bias.  Under  the  stan¬ 
dard  conditions  on  x,  consistent  estimation  of  regression  parameters  is  possible. 

An  assumption  that  is  weaker  than  (25.1)  is 

yo  ±  D\  X,  (25.5) 

which  implies  conditional  independence  of  participation  and  yo-  This  assumption  is 
used  in  establishing  identifiability  of  a  population-average  treatment  effect  on  the 
treated  (ATET),  as  will  be  seen  later. 

Assumption  (25.5)  has  other  names  in  the  literature.  Imbens  (2005)  refers  to  it  as  the 
unconfoundedness  assumption  and  Rubin  refers  to  it  as  the  ignorability  assumption 
(Rubin,  1978;  Wooldridge,  2001).  If  valid,  the  assumption  implies  that  there  is  no 
omitted  variable  bias  once  x  is  included  in  the  regression,  and  hence  there  will  be 
no  confounding.  The  assumption  is  tantamount  to  treatment  assignment  that  ignores 
outcomes;  hence  it  is  appropriate  to  refer  to  it  as  the  ignorability  assumption. 
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This  assumption  is  necessary  if  the  treatment  variable  is  to  be  treated  as  exogenous, 
which  is  essential  for  simplicity  in  estimation.  If  valid,  sample  selection  models  or  IV 
methods  to  handle  endogenous  treatment  variables  are  not  needed,  and  the  methods  of 
Section  25.4  can  be  applied. 


25.2.3.  Matching  Assumption 

A  second  assumption,  referred  to  as  the  overlap  or  matching  assumption,  is  neces¬ 
sary  for  identifying  some  population  measures  of  impact.  It  states  that 

0  <  Pr[D  =  l|x]  <  1.  (25.6) 

This  assumption  ensures  that  for  each  value  of  x  there  are  both  treated  and  nontreated 
cases.  In  that  sense  there  is  overlap  between  the  treated  and  untreated  subsamples.  For 
each  treated  individual  there  is  another  matched  untreated  individual  with  a  similar 
x.  If  the  assumption  were  to  fail,  then  we  could  potentially  have  individuals  with  x 
vectors  who  are  all  treated  and  those  with  a  different  x  who  are  all  untreated.  This 
condition  is  not  required  for  identifying  the  treatment  parameter  for  the  treated  group. 
For  identifying  the  treatment  effect  on  a  randomly  selected  individual  one  needs  for 
each  participant  an  analogous  nonparticipant.  Then  the  condition  Pr[  D  =  l\x]  <  1  is 
sufficient. 


25.2.4.  Conditional  Mean  Assumption 
A  third  assumption  is  the  conditional  mean  independence  assumption 

E[y0 1  D  =  1,  x]  =  E[y0|  D  =  0,  x]  =  E[y0|  x],  (25.7) 

which  implies  that  yo  does  not  determine  participation. 


25.2.5.  Propensity  Scores 

When  treatment  participation  is  not  by  random  assignment  but  depends  stochastically 
on  a  vector  of  observable  variables  x,  as  in  observational  data  or  when  the  treatment  is 
targeted  to  some  population  defined  by  some  observable  characteristics  (such  as  age, 
sex,  or  socioeconomic  status),  then  the  concept  of  propensity  scores  is  useful.  This 
is  a  conditional  probability  measure  of  treatment  participation  given  x  and  is  denoted 
p(x),  where 

p(x)  =  Pr[D=  l|X  =  x],  (25.8) 

The  propensity  score  measure  can  be  computed  given  the  data  ( D, ,  x, )  using  any  of 
the  parametric  or  semiparametric  methods  covered  in  Chapter  14  (e.g.,  by  doing  a  logit 
regression). 

An  assumption  that  plays  an  important  role  in  treatment  evaluation  is  the  balancing 
condition,  which  states  that 

D  _L  x|  p  (x) .  (25.9) 


864 


25.3.  TREATMENT  EFFECTS  AND  SELECTION  BIAS 


Table  25.1.  Treatment  Effects  Framework 


Symbol 

Definition 

yi 

Outcome  for  the  treated  group 

yo 

Outcome  for  the  nontreated  group 

p(x) 

Propensity  score 

Nj1 

Number  of  treated  cases  in  the  sample 

This  can  be  expressed  alternatively  by  saying  that  for  individuals  with  the  same 
propensity  score  the  assignment  to  treatment  is  random  and  should  look  identical  in 
terms  of  their  x  vector.  The  balancing  condition  is  a  testable  hypothesis. 

A  useful  result  about  conditional  independence  given  p(x)  due  to  Rosenbaum  and 
Rubin  (1983)  states  that 

Jo,  yi  -L  D\  x  =>•  }>o,  }>|  _L  D\  p  (x) .  (25.10) 

This  implies  that  the  conditional  independence  assumption  given  x  implies  conditional 
independence  given  p(x),  that  is,  independence  of  yo ,  yi,  and  D  given  p(x). 

To  obtain  this  result,  note  that 

Pr[£>  =  l|y0,  yi,  p(x)]  =  E[Z)  |y0,  yu  p(x)] 

=  E[E[D  |yo,  yi,  p(x),  x]|y0,  yi,  p(x)] 

=  E[E[D  |y0,  yi,  x]|y0,  yi,  p(x)] 

—  E[E[D  | x] | y0 ,  yi,  p(x)] 

=  E[/7(x)|y0,  yi,  p(x)] 

=  p(x). 

Here  the  second  and  third  lines  follow  from  the  law  of  iterated  expectations.  The  fourth 
line  uses  conditional  independence.  The  intuition  behind  this  result  is  that  p(x)  is  a 
particular  function  of  x  and,  in  a  sense,  contains  less  information  than  x.  Hence  con¬ 
ditional  independence  given  p(x)  is  implied  for  the  same  given  x.  Because  by  condi¬ 
tioning  on  x  we  get  rid  of  the  correlation  between  x  and  D ,  likewise  by  conditioning 
on  the  propensity  score  p(x)  we  also  expunge  the  correlation  between  x  and  D.  Thus 
a  regression  similar  to  (25.4)  is 

y  =  x'/3  +  ap(x )  +  u  (25.1 1) 

=  x'(3  +  ap(x)  +  (m  +  a(p(x)  -  p(x)),  (25.12) 

where  in  the  second  line  the  unknown  p(x)  is  replaced  by  a  sample  estimate,  resulting 
in  the  addition  of  the  sampling  error  to  the  regression  error.  The  pros  and  cons  of  this 
strategy  will  be  considered  later.  Table  25.1  summarizes  the  notation. 


25.3.  Treatment  Effects  and  Selection  Bias 

We  begin  by  presenting  two- widely  used  measures  of  treatment  effect  -  one  that  aver¬ 
ages  over  all  individuals  and  one  that  averages  over  only  the  treated.  We  then  discuss 
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in  some  detail  the  role  of  selection  into  treatment.  The  methods  presented  in  Sections 
25.4-25.6  presume  that  selection  effects  directly  depend  on  only  measurable  observed 
characteristics  of  the  individual,  such  as  age.  If  additionally  selection  effects  depend 
on  unobservables  then  the  methods  of  Chapter  16  must  instead  be  used.  The  current 
section  includes  considerable  discussion  of  selection  issues. 


25.3.1.  Two  Key  Parameters:  ATE  and  ATET 
Define  A  as  the  difference  between  the  outcome  in  the  treated  and  untreated  states 


A  =  yi  —  yo,  (25.13) 

where  we  may  condition  on  x  if  desired.  It  is  emphasized  that  A  is  not  directly  ob¬ 
servable  because  no  individual  can  be  observed  in  both  states.  Population  values  of  the 
average  treatment  effect  and  average  treatment  effect  on  the  treated  are  defined  as 


ATE  =  E[A],  (25.14) 

ATET  =  E[A|D  =  1],  (25.15) 

with  sample  analogues 

_  l  A 

ATE=-^|A,],  (25.16) 

^  1  =  1 
1  Nt 

ATET=  —  £[A;|A  =  1],  (25.17) 

Nt  i=  l 

where  Nj  =  A  -  In  each  of  these  two  cases,  computation  is  straight-forward  if 
A,  can  be  obtained.  The  procedure  is  not  direct  because  the  formulas  have  an  unob¬ 
served  component  that  must  be  estimated  and  that  step  calls  for  some  assumptions. 

The  ATE  measure  is  relevant  when  the  treatment  has  universal  applicability  so  that 
it  is  reasonable  to  consider  the  hypothetical  gain  from  treatment  to  a  randomly  selected 
member  of  the  population.  The  ATET  measure  is  relevant  when  we  want  to  consider 
the  average  gain  from  treatment  for  the  treated.  See  Heckman  and  Vytlacil  (2002). 

To  understand  the  treatment  evaluation  problem  consider  the  average  gain  from 
participation  given  characteristics  x.  This  is 

ATE  =  E  [A|  A  =  x]  (25.18) 

=  E[yi  -  y0\X  =  x] 

=  E[yi  |  A  =  x]  -  E[y0|  A  =  x] 

=  E[yi\x,D=  l]-E[y0|x,D  =  0], 

where  the  last  equality  uses  the  conditional  independence  assumption  (25.1). 

Given  a  sample  of  participants,  E[yi|D  =  l,x]  can  be  estimated.  However, 
E[yo|x,  D  =  0]  is  not  observable  because  it  is  a  measure  of  the  average  outcomes 
for  the  participants  had  they  in  fact  not  participated,  and  one  cannot  simultaneously 
observe  the  same  individuals  as  both  participants  and  nonparticipants.  To  make  ATE 
operational  we  must  find  an  estimator  for  the  second  term. 
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By  definition  (25.18) 

ATE  =  E[yi|x,  D  =  1]  —  E[y0|x,  D  =  0]  (25.19) 

=  Mi(x)  —  |U0(x)  +  E[mi |x,  D  =  1]  —  E[mo|x,  D  —  0] 

=  Mt(x)  -  Mo(x)  +  E[mi |x]  -  E[«0|x] 

=  Mi(x)  -  Mo(x),  (25.20) 

where  the  first  term  in  the  first  line  on  the  right-hand  side  can  be  estimated  using 
the  data  from  treatment  participants,  but  the  second  term  is  not  directly  observable. 
The  next  three  lines  follow  by  applying  the  conditional  independence  and  conditional 
mean  assumption  and  adopting  the  specifications  yi  =  Mi(x)  +  u  \  for  the  treated  and 
y0  =  /r0(x)  +  Ho  for  the  untreated.  The  second  from  the  last  line  only  requires  mean 
independence  rather  than  full  conditional  independence. 


25.3.2.  Sampling  and  Selection  Bias 

The  crux  of  the  evaluation  problem  is  that  E[yo|x,  D  =  1]  is  unobservable.  The  solu¬ 
tion  to  the  problem  depends  in  part  on  the  type  of  data  available.  Social  experiments 
use  the  eligible  participants  that  are  excluded  from  the  treatment  group  as  a  proxy 
for  the  counterfactual.  Observational  studies  generate  a  comparison  group  from  the 
same  source  as  the  treated  group,  or  from  other  databases,  and  essentially  end  up  us¬ 
ing  some  function  of  E[yo|x,  D  =  0]  that  can  be  estimated  using  data  from  nonpartic¬ 
ipants.  The  simplicity  of  the  computation  when  the  data  come  from  a  well-designed 
and  executed  social  experiment  should  be  viewed  against  the  background  of  actual 
social  experiments,  which  are  subject  to  other  problems  such  as  randomization  bias 
and  substitution  bias  (discussed  in  Chapter  3). 

Suppose  that  for  the  treated  participants  the  outcome  equation  is 


yi  =  E[yi|x]  +  mi  (25.21) 

=  Hi  (x)  +  «i  (25.22) 

and  for  the  nonparticipants  the  equation  is 

y0  =  E[v0|x]  +  h0  (25.23) 

=  Mo  (x)  +  H0.  (25.24) 


Note  that  this  specification  is  of  the  switching  regression  type  (analogous  to  the  Roy 
model  discussed  in  Section  16.7)  in  the  sense  that  the  treated  and  nontreated  have 
different  conditional  mean  functions,  /i ,  (x)  and  /i()  (x),  that  are  written  in  a  more 
general  notation  than  necessary  for  the  purely  linear  model.  We  assume  that  E[hi  |x]  = 
E[uo|x]  =  0,  though  E[hi|x,  D]  and  E[ho|x,  D ]  do  not  necessarily  equal  zero. 

A  more  common,  but  restrictive,  specification  has 

M j  (x)  =  Mo  (x)  +  o/D,  (25.25) 

in  which  the  treated  group  has  an  additional  intercept  component  a,  but  the  slope 
coefficients  of  the  regressors  are  unaffected  by  the  treatment. 
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Table  25.2.  Treatment  Effects  Measures:  ATE  and  ATET 


Measure 

Treatment  Effect 

Special  Case  (25.25) 

ATE  given  x 

E  [A|x]  =  Mi(x)  -  Mo(x) 

E  [A|x]  =  a 

ATET  with  x 

and  selection  effect 

E [A|x,  D  =  1] 

=  Mi(x)  -  Mo(x) 

+  E  [Ml  —  Mo  x,  D  =  1] 

E  [A|x,  D  =  1] 

=  a  +  E  [mi  —  mo|x,  D  —  1] 

Additional  benefit 
to  individual  with  x 

E  [u\  —  «o|x,  D  =  1  ] 

E  [mi  —  Mo|x,  D  —  1] 

Average  selection  bias 

E  [mo|x,  D  —  1] 

-E[m0|x,  £>  =  0] 

E  [mo|x,  D  —  1  ] 

—  E  [mq  x,  D  —  0] 

The  observed  outcome  y  is  written  as 


y  =  Dy i  +  (1  -  D)y0. 


(25.26) 


Combining  these  equations  we  get 

y  =  D  (m,  (x)  +  mi)  +  (1  -  D)  (fi0  (x)  +  M0) 

=  Mo  (x)  +  D  (Mi  (x)  -  Mo  (x)  +  Ml  -  M0)  +  m0.  (25.27) 

Because  D  =  1  or  0,  the  second  term  in  the  regression  “switches”  on  and  off.  The 
second  term  in  (25.27)  measures  the  benefit  of  participation;  the  first  component 
Mi(x)  —  Mo(x)  measures  the  average  gain  to  a  participant  with  characteristics  x  and 
the  second  component  (u\  —  uff)  is  individual-specific  benefit.  The  second  component 
may  be  observable  by  the  participant,  but  not  by  the  investigator. 

The  expressions  for  ATE  and  ATET  are  given  in  Table  25.2,  for  the  general  case 
and  the  specialization  (25.25). 

Average  selection  bias  is  the  difference  between  program  participants  and  nonpar¬ 
ticipants  in  the  base  state.  This  effect  cannot  be  attributed  to  the  program.  A  special 
case  is  E[u\  —  mqIx,  D  =  1]  =0,  which  can  arise  if  there  are  no  unobservable  compo¬ 
nents  of  the  benefit  or  if  the  best  individual  estimate  of  u\  —  uq  is  zero. 

Selection  bias  arises  when  the  treatment  variable  is  correlated  with  the  error  in  the 
outcome  equation.  This  correlation  could  be  induced  by  incorrectly  omitted  observable 
variables  that  partly  determine  D  and  y.  Then  the  omitted  variable  component  of  the 
regression  error  will  be  correlated  with  D  -  the  case  of  selection  on  observables. 
Another  source  comprises  unobserved  factors  that  partly  determine  both  D  and  y.  This 
is  the  case  of  selection  on  unobservables.  The  conditional  independence  assumption 
essentially  rules  out  confounding  caused  by  omitted  variables. 


868 


25.3.  TREATMENT  EFFECTS  AND  SELECTION  BIAS 


25.3.3.  Selection  on  Observables 

In  observational  data  the  problem  of  selection  on  observables  is  solved  using  regres¬ 
sion  and  matching  methods.  Subsequent  sections  of  this  chapter  present  these  methods 
in  detail.  Before  doing  so,  we  note  that  the  two-part  model  of  Section  16.4  is  an  exam¬ 
ple,  and  in  this  section  we  discuss  a  second  straightforward  method. 

The  control  function  estimator  is  motivated  by  the  possibility  that  a  set  of  observ¬ 
able  variables  z  that  determine  D  may  be  correlated  with  the  outcomes.  For  concrete¬ 
ness  let  us  consider  the  special  case  where  the  outcome  equation  is 

y,  =  x'j/3  +  <xDj  +  Ui  (25.28) 


and  the  error  is  such  that 


E[«i|*i,  A  ]  =  E[m;  |X;,  Dh  z,  ]. 

In  the  case  of  selection  on  observables  we  may  have  E[m,  |z,  ]  7^  0.  Let  us  write 

E[y;|x/,  Di,Zj]  =  x-/3  +  aD,  +  E[n,|x,-,  z,],  (25.29) 

which  motivates  the  use  of  a  control  function  estimator  based  on  the  OLS/GLS  es¬ 
timation  of  the  equation.  The  essential  idea  is  to  introduce  into  the  outcome  equation 
all  observable  variables  that  could  possibly  be  correlated  with  n,  and  then  estimate  the 
resulting  equation  by  least  squares.  Specifically, 


=  C;d  +  aD,  +  [u,  -  E[u,\Dj,  C,]} ,  (25.30) 

where  C,  includes  all  variables  that  are  included  in  either  x  or  z.  The  presence  of  z  in 
the  regression  expunges  the  possible  correlation  between  u  and  z.  Note  that  if  there  is 
selection  on  unobservables,  caused  by  common  unobservable  factors  that  affect  both 
D  and  u,  then  we  still  have  a  potential  identification  problem. 

This  estimator  was  used  by  Heckman  and  Hotz  (1989),  who  also  suggested  a  num¬ 
ber  of  variations  on  the  basic  control  function  estimators. 


25.3.4.  Selection  on  Unobservables 

We  now  consider  a  special  linear  case  in  which  the  treatment  participation  decision 
is  endogenous.  This  is  an  example  of  a  well-known  class  of  models  with  an  “en¬ 
dogenous  dummy  variable.”  The  model  is  empirically  very  important  when  working 
with  observational  data  because  in  such  cases  there  are  several  reasons  for  aban¬ 
doning  the  restrictive  assumption  yo,  y\  _L  D\  x  or  E[m|x./4|  =  0.  The  breakdown 
of  the  conditional  independence  assumption  implies  that  the  simple  least-squares  re¬ 
gression  cannot  identify  the  ATE,  and  an  alternative  identification  strategy  should  be 
pursued. 

The  essential  elements  of  the  identification  strategy  we  are  about  to  discuss  are 
common  to  other  selection  models.  The  approach  involves  fairly  strong  identifying  as¬ 
sumptions  and  is  fully  parametric.  In  the  special  case  considered,  the  specification  is 
analogous  to  the  Roy  model.  The  conditional  means  in  the  outcome  equations  are  taken 
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to  be  linear.  The  model  is  completed  by  adding  a  participation  (binary)  decision  equa¬ 
tion  for  I), .  Then 


yu  =x'if31+uu,  (25.31) 

yOi  =  X;/30  +  UQi, 

D*  =  z/7  + 


where  D*  is  a  latent  variable  such  that 


1  iff  D*  >  0, 
0  iff  D*  <  0, 


(25.32) 


and  it  is  assumed  that  E[hi|x,  z]  =  E[mo|x,  z]  =  0. 

The  variables  z  may  overlap  with  x,  but  it  is  assumed  that  at  least  one  component  of 
z,  denoted  zi,  is  unique  and  is  a  nontrivial  determinant  of  D.  That  is,  there  is  at  least 
one  independent  source  of  variation  in  D.  Hence  we  may  refer  to  z  |  as  an  instrumental 
variable  that  is  correlated  with  the  endogenous  variable  D,  but  uncorrelated  with  the 
outcomes  yi  and  yo,  except  through  D. 

Next  it  is  assumed  that  the  triple  (uu,  Ho, ,  e, )  is  jointly  multivariate  normal  dis¬ 
tributed  with  zero  mean  and  covariance  matrix  X  given  by 


o\\ 

o'  io 

G  le 

X  = 

a  io 

(TOO 

CTOe 

(25.33) 

_  U"  1 E 

&0e 

1 

The  nonzero  covariance  parameters  n  ls  and  croe  reflect  the  endogeneity  of  the  treat¬ 
ment  variable.  The  covariance  parameter  a  io  reflects  the  covariance  between  the  out¬ 
comes.  Because  we  never  observe  any  individual  in  both  states,  this  parameter  cannot 
be  identified  and  is  usually  set  to  zero.  The  variance  of  s  is  restricted  to  1  for  identifi¬ 
cation. 

Given  such  a  fully  parametric  specification,  the  model  can  be  estimated  by  maxi¬ 
mum  likelihood  or  by  a  two-step  semiparametric  procedure.  Most  of  these  issues  have 
been  discussed  in  Chapter  16.  Leaving  aside  the  estimation  issue,  we  consider  mea¬ 
sures  of  treatment  impact. 

The  benefit  of  participation,  or  the  ATET,  is  given  by 

yu  -  E [yo;  I  A  =  1]  =  yu  -  xj /3„  +  a0£  -  ,  (25.34) 

(1  -  0(zj7)) 

which  may  also  be  written  as 

ELv„  |  D,  =  1]  -  E[j-0,  |  Dj  =  1]  =  xj(/3j  -  p0)  +  (o' 0e  -  <n (25.35) 

where  the  term  —  a  i£)  0(z-7)/<I>(z-7)  denotes  the  selection  effect;  see  Section 
16.7.1. 

In  the  special  case  in  which  x'/30  =  x'  (3 , ,  and  the  treatment  dummy  enters  the  yi 
equation  linearly  with  coefficient  a,  the  mean  impact  of  the  program  is  given  by 


E[_v,  |  Dj  =  1]  —  E[  v,-  |  Dj  =  0]  =  a  +  selection  term. 
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In  some  sample  situations  this  identification  strategy  may  be  somewhat  fragile. 
For  example,  the  treated  and  untreated  groups  may  be  too  different,  the  multivariate 
normality  assumption  may  be  inappropriate,  or  the  identifying  instrumental  variable 
Z\  may  be  weak  or  possibly  correlated  with  the  error  in  the  outcome  equations. 

These  considerations  motivate  the  use  of  alternative  estimation  methods  presented 
in  this  chapter.  These  estimators  generally  presume  selection  on  observables  only, 
though  Section  25.7  presents  IV  methods  applicable  when  selection  is  additionally  on 
unobservables. 


25.4.  Matching  and  Propensity  Score  Estimators 

In  observational  studies,  by  definition  there  are  no  experimental  controls.  Therefore, 
there  is  no  direct  counterpart  of  the  ATE  calculated  as  a  mean  difference  between  the 
outcomes  of  the  treated  and  nontreated  groups.  In  other  words,  the  counterfactual  is 
not  identified.  As  a  substitute  we  may  obtain  data  from  a  set  of  potential  comparison 
units  that  are  not  necessarily  drawn  from  the  same  population  as  the  treated  units,  but 
for  whom  the  observable  characteristics,  x,  match  those  of  the  treated  units  up  to  some 
selected  degree  of  closeness. 

The  average  outcome  for  the  untreated  matched  group  identifies  the  mean  counter- 
factual  outcome  for  the  treated  group  in  the  absence  of  the  treatment.  This  approach 
solves  the  evaluation  problem  by  assuming  that  selection  is  unrelated  to  the  untreated 
outcome,  conditional  on  x.  To  make  the  approach  operational  it  is  necessary  to  define 
the  matching  criteria. 


25.4.1.  Treatment  Effect  Assumptions 

Matching  estimators  of  treatment  effects  are  useful  when  selection  into  treatment  is  on 
observables  only.  In  addition  it  is  assumed  the  overlap  (or  support)  condition  (25.6) 
applies,  which  means  that  for  every  x  there  is  a  positive  probability  of  nonparticipation. 
This  ensures  that  we  have  untreated  matches  for  the  treated  observations  for  every 
x.  Roughly  speaking,  the  control  and  treated  populations  have  comparable  observed 
characteristics.  Generating  good  matches  means  ensuring  that  the  support  condition 
does  not  fail.  Further,  the  key  assumption  is  that  unobservable  variables  play  no  role 
in  the  treatment  assignment  and  outcome  determination. 

The  regression  estimator  imputes  the  missing  potential  outcome  using  the  estimated 
regression  function.  If  D(  =  1 ,  y0  i  is  imputed  using  the  estimated  conditional  regres¬ 
sion  function  /i0(x,).  Matching  estimators  impute  the  missing  value  using  the  out¬ 
comes  of  the  “nearest  neighbors”;  the  latter  are  defined  by  a  suitable  metric  based  on 
some  observable  characteristics.  This  is  the  basis  of  the  analogy  between  a  matching 
estimator  and  nonparametric  methods  based  on  the  number  of  nearest  neighbors,  typi¬ 
cally  just  one.  The  matching  estimator  typically  approximates  the  difference  between 
the  means,  and  the  variance  of  the  estimator  is  estimated  using  many  of  the  available 
results  on  variance  of  differences  between  the  means. 
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Matching  is  a  persuasive  and  attractive  methodology  if  (1)  we  can  control  for  a 
rich  set  of  x  variables,  (2)  there  are  many  potential  controls,  and  (3)  ATET  is  the 
parameter  of  interest.  It  also  requires  the  “no  general  equilibrium  effects”  assumption, 
or  stable  unit  treatment  value  assumption  (SUTVA),  which  implies  that  treatment 
does  not  indirectly  affect  untreated  observations.  The  matching  estimator  avoids  the 
assumption  that  the  treatment  effect  enters  the  conditional  mean  function  linearly.  The 
initial  step  of  establishing  the  nearest  matches  for  each  observation  will  also  clarify 
whether  comparable  control  observations  are  available.  Unlike  the  regression  approach 
there  is  less  danger  of  extrapolation  into  regions  outside  the  range  of  the  data. 

Suppose  the  treated  cases  are  matched  in  terms  of  all  observable  covariates.  In  a 
restricted  sense  all  differences  between  the  treated  and  untreated  groups  are  controlled. 
Given  the  outcomes  yu  and  _vo, .  for  the  treatment  and  control,  respectively,  the  average 
treatment  effect  is 


E  [y  u  I A  =  1]  -  E  [v0, 1  Dj  =  0]  (25.37) 

=  E[yi,  -  y0i\Di  —  1]  +  {E  [y0i |  D,  =  1]  -  E  [y0; |  D,  =  0]}. 

The  first  term  in  the  second  line  is  the  ATET,  and  the  second  term  in  braces  is  a  “bias” 
term,  which  will  be  zero  if  the  assignment  to  the  treatment  and  control  is  random.  In 
this  case  all  that  is  necessary  to  estimate  the  ATET  is  a  simple  average  of  the  differen¬ 
tial  due  to  treatment. 

More  realistically  the  data  will  involve  some  observed  covariates  x, .  It  is  assumed 
that  the  covariates  include  variables  that  include  the  determinants  of  selection  into  the 
treatment  group.  If  treated  and  nontreated  groups  are  matched  on  each  combination 
of  covariates,  then  the  treatment  differential  can  be  easily  calculated  for  each  treated 
case  and  each  x,  .  The  average  of  the  differential  over  all  treated  individuals  and  all  x, 
measures  the  average  treatment  effect.  Formally,  in  this  case  (see  Angrist  and  Krueger, 
2000,  p.  1316)  the  effect  of  the  treatment  on  the  treated  is  given  by 

E[ vi,  -  yo,\Di  =  1]  =  E[{E[vi, lx,,  Dt  =  1]  -E[y0i\xh  D,  =  0]}|A  =  U  (25.38) 
=  E[AX|A  =  1], 

where  Ax  =  E  [yi,  |x;,  D,  =  1]  -  E[y0i|x;,  D,  =  0]. 

If  the  x  variables  are  discrete,  then  the  matching  estimator  is  defined  as  a  weighted 
sum 

E[yl(  -  }’o,  I  Dj  =  1]  =  Ax  Pr[x,  =  x|D,  =  1],  (25.39) 

X 

where  Pr[x,  =  x|  D,  =  1]  is  the  probability  mass  for  x(,  given  D,  =  1 .  Angrist  and 
Krueger  (2000)  discuss  several  aspects  of  this  estimator. 


25.4.2.  Exact  Matching 

The  procedure  is  to  match  treated  and  untreated  individuals  on  their  observable  char¬ 
acteristics  x. 

Exact  matching  is  practicable  when  the  vector  of  covariates  is  discrete  and  the 
sample  contains  many  observations  at  each  distinct  value  of  x, . 
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If  the  covariate  vector  has  a  high  dimension,  or  if  continuous  variations  among  some 
covariates  are  present,  then  exact  matching  between  treated  and  nontreated  groups 
becomes  impractical.  This  problem  motivates  inexact  matching  methods.  Inexact 
matching  works  by  mapping  x  into  a  lower  dimensional  measure,  continuous  or  dis¬ 
crete,  usually  a  scalar  /(x)  that  forms  the  basis  of  matching. 


25.4.3.  Propensity  Scores 

The  method  of  propensity  scores  (Rosenbaum  and  Rubin,  1983)  is  a  popular  inexact 
matching  method.  Rather  than  match  on  the  regressors  it  matches  on  the  propensity 
score.  Even  here  an  exact  match  is  not  possible,  so  the  comparison  units  are  those 
whose  propensity  scores  are  sufficiently  close  to  the  treated  unit. 

The  propensity  score,  the  conditional  probability  of  receiving  treatment  given  x, 
denoted  p(x),  was  suggested  by  Rosenbaum  and  Rubin  (1983)  as  a  matching  measure. 
As  noted  in  Section  25.2.5,  if  the  data  justify  matching  on  x,  then  matching  based  on 
propensity  score  is  also  justified. 

The  propensity  score  is  usually  estimated  using  a  parametric  model  such  as  a  logit 
or  probit  but  can  in  principle  be  estimated  using  nonparametric  methods. 


Matching  Using  Propensity  Scores 

In  the  method  of  propensity  scores  one  controls  for  the  covariates  by  controlling  for 
a  particular  function  of  the  covariates,  specifically  the  conditional  probability  of  treat¬ 
ment,  Pr  [D,  =  1 1 x,  ] .  That  is,  matching  is  on  the  propensity  score.  This  can  be  eas¬ 
ily  calculated  by  (for  example)  a  logit  regression.  Moreover,  one  can  also  control  for 
lagged  variables  by  including  them  in  the  vector  of  covariates.  If  selection  bias  is  elimi¬ 
nated  by  controlling  for  x, ,  it  is  also  eliminated  by  controlling  for  the  propensity  score. 
Conditioning  on  the  propensity  score  is  often  simpler  than  conditioning  on  a  large  di¬ 
mensional  vector  x.  Dehejia  and  Wahba  (1998)  provide  an  empirical  illustration  based 
on  the  data  previously  used  by  Lalonde  (1986). 


Implementation  Issues 

Propensity  score  methods  call  for  a  good  model  to  generate  the  scores.  Our  interest 
is  in  estimating  consistently  the  participation  probability  rather  than  the  estimates  of 
parameters  in  the  propensity  score  function.  A  better  statistical  fit  for  the  propensity 
score  is  more  likely  to  result  from  a  flexible  parametric  or  nonparametric  model. 

In  implementing  matching  based  on  p(x,)  three  issues  are  relevant:  (1)  whether  to 
match  with  or  without  replacement,  (2)  the  number  of  units  to  use  in  the  comparison 
set,  and  (3)  the  choice  of  the  matching  method. 

Matching  without  replacement  means  that  any  observation  in  the  comparison  group 
is  matched  to  no  more  than  one  treated  observation,  that  which  is  the  closest  match, 
whereas  with  replacement  means  that  there  can  be  multiple  matches.  If  matching  with¬ 
out  replacement,  the  smallness  of  the  comparison  set  would  mean  that  the  matches  may 
not  be  very  close  in  terms  of  p(x),  which  will  increase  the  bias  of  the  estimator. 
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The  issue  of  choosing  the  number  of  cases  in  the  comparison  set  involves  trade¬ 
off  between  bias  and  variance.  By  using  a  single  closest  match  to  a  treated  case, 
one  reduces  the  bias,  but  by  including  more  matched  controls,  the  variance  is  re¬ 
duced  whereas  bias  increases  if  the  additional  observations  are  inferior  matches  for  the 
treated  observations.  A  partial  solution  is  to  use  a  predefined  neighborhood  in  terms 
of  a  radius  around  the  p(x)  of  the  treated  observation  and  to  exclude  matches  that  lie 
outside  this  neighborhood.  In  other  words,  one  only  uses  the  better  matches.  This  is 
called  “caliper  matching.” 

Heckman  et  al.  (1997,  1998)  study  the  performance  of  matching  estimators  using 
experimental  data  from  the  Job  Training  Partnership  Act  (JTPA)  combined  with  sam¬ 
ples  of  comparison  groups  from  three  sources.  Data  quality  plays  a  key  role  in  robust 
estimation  of  treatment  effects  by  matching  methods.  The  results  are  best  when  the 
data  sources  and  definitions  are  comparable  for  treated  and  nontreated  groups,  when 
the  treated  and  nontreated  come  from  the  same  labor  market,  and  when  the  propensity 
score  can  be  modeled  using  a  rich  set  of  regressors. 

The  issue  of  the  sensitivity  of  the  results  to  the  chosen  method  is  not  amenable  to 
a  simple  direct  answer.  The  outcome  may  vary  across  different  samples,  depending 
on  the  extent  of  overlap  between  the  treated  and  untreated  observations.  If  the  two 
groups  are  similar  in  the  sense  that  there  is  a  substantial  overlap  in  their  propensity 
scores,  and  if  the  comparison  group  is  large,  then  the  matches  will  be  easier  to  find 
and  matching  with  replacement  will  be  feasible.  If  the  comparison  group  is  small  and 
disparate  from  the  treated  group,  then  one  may  run  out  of  satisfactory  matches  and  be 
unable  to  use  the  full  treated  sample,  this  being  especially  likely  if  matching  is  without 
replacement. 

The  application  of  Dehejia  and  Wahba  (2002)  to  the  National  Supported 
Work  Program  data  provides  an  instructive  illustration.  We  examine  and  illus¬ 
trate  the  issues  of  implementation  in  Section  25.8  using  the  Dehejia  and  Wahba 
data  set. 


25.4.4.  Measuring  Treatment  Effects 

Denote  the  comparison  group  for  the  treated  case  i  with  characteristics  x,  as  the  set 
Aj(x)  =  {/  X  j  e  c(x,)},  where  c  (x,)  is  the  characteristics  neighborhood  of  x, .  Let  Nc 
denote  the  number  of  cases  in  the  comparison  group  and  let  w(i,  j )  denote  the  weight 
given  to  the  yth  case  in  making  a  comparison  with  the  /th  treated  case,  JT  w(i,  j)  =  1. 
Then  a  general  formula  for  the  matching  ATET  estimator  is 

aM  =  TT  E  I W'  ^  'r<'-  ./).''(>./ 1.  (25.40) 

Nt  ie{D  =  1)  J 


where  0  <  w(i,  j )  <  1,  and  [D  =  1}  is  the  set  of  treated  individuals,  and  j  is  an  ele¬ 
ment  of  the  set  of  matched  comparison  units.  Different  matching  estimators  are  gener¬ 
ated  by  varying  the  choice  of  w(i,  j). 
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Matching  Methods 


Simple  matching  compares  cells  with  exactly  the  same  discrete  x, 

AM  =  £iu*[yu-y0 ,kii  (25.41) 

k 

where  y ,  is  the  mean  outcome  of  the  treated  and  y0  is  the  mean  outcome  of  the  un¬ 
treated  and  u>k  is  the  weight  of  the  Ath  cell  (i.e.,  the  fraction  of  observations  in  cell  A). 
A  specific  example  (Dehejia  and  Wahba,  2002)  is 


1 

Nj 


E 

i 


i 

N^i 


E  v/) 

je{D  =  0)  / 


(25.42) 


where  Nj  is  the  number  in  the  treated  group  (D  =  1)  and  Ncj  is  the  number  in  the 
comparison  group  corresponding  to  the  /  th  observation. 

The  nearest- neighbor  matching  method  chooses,  for  every  treated  individual  i,  the 
set  Aj(x)  =  {  /  min;  J|  x,-  —  Xj  || },  where  ||||  denotes  the  Euclidean  distance  between 
vectors.  If  w(i,  j )  =  1  in  (25.40)  when  j  e  A,(x),  and  zero  otherwise,  then  this  speci¬ 
fication  uses  only  one  case  to  construct  the  comparison  group  for  the  treated  cases. 

Another  estimator  is  generated  by  kernel  matching  in  which 


w(i,  j )  = 


K(xj  -  Xj) 
EK  Kixj-Xi)’ 


where  A)  is  a  kernel  discussed  in  Section  9.3. 

These  methods  share  the  advantage  that  they  avoid  functional  form  assumptions  for 
the  outcome  equations  in  estimating  ATET  and  can  estimate  it  at  specific  values  of  x. 
They  have  the  disadvantage  that  if  x  is  high  dimensional  then  the  number  of  matches 
can  become  very  small.  In  such  cases  matching  based  on  a  scalar-valued  metric  has 
attractions.  Propensity  score  matching,  discussed  previously,  is  such  a  method. 

Nearest-neighbor  and  kernel  matching  can  be  defined  in  terms  of  propensity  scores 
also.  For  example,  for  nearest-neighbor  matching  we  can  define  the  matching  set  as 
Ai(p(x))  =  [pj |  min,-  || pt  -  pj\\). 

Stratification  or  interval  matching  is  based  on  the  idea  of  dividing  the  range  of 
variation  of  the  propensity  score  in  intervals  such  that  within  each  interval  the  treated 
and  control  units  have,  on  the  average,  the  same  propensity  score.  One  can  use  the 
same  blocks  identified  by  the  algorithm  used  for  computing  the  propensity  scores. 
Then  we  compute  the  difference  between  the  average  outcomes  of  the  treated  and 
the  control  groups.  ATET  is  the  weighted  average  of  these  differences,  with  weights 
being  determined  by  the  distribution  of  the  treated  units  across  the  blocks.  One  of  the 
disadvantages  of  this  method  is  that  it  discards  observations  in  blocks  in  which  either 
treated  or  control  units  are  absent. 

Denote  by  b  the  blocks  defined  over  intervals  of  propensity  score.  Then  the  treat¬ 
ment  effect  within  the  Ath  block  is  defined  as 


ATET l  =  (Aj)'1  £  YU  -  KT'  £  Yoj, 
iem  i^m 


where  1(b)  is  the  set  of  units  in  block  b.  is  the  number  of  treated  units  in  the  Ath 
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block,  and  Ny  is  the  number  of  control  units  in  the  fith  block.  Then  the  treatment  effect 
based  on  stratification  is  defined  as 


ATET5  =  J2  ATET,5 


h=  1 


E  d7E°' 


.iel(b) 


(25.43) 


where  the  term  in  brackets  is  the  weight  for  each  block  given  by  the  corresponding 
fraction  of  treated  units  and  where  B  is  the  total  number  of  blocks. 

In  radius  matching  the  set  A, ( /Ex))  =  {pj \  p,  —  pt  |  <  r }  is  based  on  propen¬ 
sity  scores.  This  means  that  all  control  cases  with  estimated  propensity  scores  falling 
within  radius  r  are  matched  to  the  i  th  treated  case. 

We  can  express  ATE  and  ATET  in  terms  of  p(x),  assuming  the  overlap  condition 
0  <  p(x)  <  1 .  The  two  key  results  are 


ATE  =  E 


CD  ~  P(x))  y 

E(x)(  1  -  p(x)) 


(25.44) 


ATET  =  E 


(D  -  p(x))  y 
Pr[D  =  1]  (1  —  p(x)) 


the  last  result  is  due  to  Dehejia  (1997). 

The  derivations  of  these  results  are  as  follows: 


(25.45) 


y  —  (1  -  D)yo  +  Dyi 
—  yo  +  D(y i  -  y0), 

( D  -  p(x))y  =  (D  —  p(x))(y0  +  D(yi  -  }>0)) 

=  Dvi  -  p(x)y0  -  Dp(x)y\  +  Dp(x)y0 
—  Dvi  -  p(x)(l  -  D)yo  -  Dp(x)yi .  (25.46) 

Next,  taking  expectations  and  noting  that  E[D|x]  =  p(x)  we  get 

E[(D  -  p(x))y\x]  =  p(x)E[yi]  -  p(x)(l  -  p(x))E[y0]  -  p2(x)E[y,]  (25.47) 

=  t(x)E[>'i  -  p(x).Vi]  -  p(x)(  1  -  p(x))E[y0] 

=  P(x)(  1  -  p(x))E[y  i  -  y0], 


whence  it  follows  that 


ATE  =  E[y i  -  y0]  =  E 


(D  ~  p(x))  y 
p(x)  ( 1  -  p(x)) 


To  derive  the  Dehejia  result,  we  have 


E 


(D  -  p(x))  y 
1  -  p(x) 


=  E[/;(x)E[/i|  (x)  -  p0(x)]] 
=  E[D(yi  -  y0)] 


=  E[D(yj  -y0)|D  =  l]Pr[D=  1], 


(25.48) 


where  the  first  line  follows  from  (25.47),  the  second  line  is  implied  by  the  conditional 
independence  assumption,  and  the  last  line  expresses  joint  expectation  as  a  product  of 
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marginal  and  conditional  expectations,  which  implies 

ATET  =  E[D(yi  -  yo)] 

Pr  [D  =  1J 

Using  (25.44)  and  (25.45),  consistent  estimators,  based  on  a  sample  of  size  N,  are 


-  1  A 

ATE  =  —  > 
N 

1  =  1 


(Pi  -p(Xj))yi  ' 
P(x/K  1  -  p(x,  )). 


(25.49) 


-1 


™=hl>)  E 


1  =  1 


1  (A  ~  p(x,))y,- 
f-;lN  (\-p(x,)) 


(25.50) 


where  (N  1  £),■)  is  a  consistent  estimator  of  Pr[D  =  1], 


25.4.5.  Variance  of  ATET  Based  on  x  and  p(x) 

Under  identifi ability  assumptions  given  in  Section  25.2,  Ax  and  A/;(x)  are  defined  as 

Ax  =  J  VW  _  E[y0|  D  =  0,  x  =  x,]] 

LVT  - 

=  u  E  [yw  -  E  w>jyoj] 

T  ;€{£>=  1)  jeMx) 


and 

A P(x)  =  - l—  Y'b'u  -  E[y0|D  =  0,  p(x)  =  p(x,)]]. 

J\Jj  - 

=  TT  E  [>'<  -  E  WUyojl 

1'IT  ie{D= 1}  j€A,(p(x)) 

where  i  is  the  subscript  for  the  treated  group,  m>,-;  =  l/Ncj,  and  Nc,i  is  the  number  of 
cases  in  the  comparison  group  for  the  /  th  treated  case.  Both  are  consistent  estimators  of 
ATET,  E[yi  —  yo  |  D  =  1,  x],  the  first  based  on  x,  and  the  second  on  p(x).  A  practical 
issue  is  whether  adjusting  for  differences  by  propensity  score  is  better  in  terms  of 
efficiency  than  adjusting  for  differences  using  x.  Hahn  (1998),  Heckman  et  al.  (1998), 
and  others  have  shown  that  there  is  no  unambiguous  ranking  of  the  two  estimators  in 
terms  of  their  asymptotic  variance,  even  if  we  assume  that  p(Xj)  is  known,  which  in 
practice  will  not  be  the  case  in  observational  studies. 

Write  the  asymptotic  variances  for  the  two  cases  as  follows: 

V[Ax]  =  E[V[y1|£>=  l,x]|£>=  1]  +  V[E[yi  -  y0|T»  =  1,  x]|Z)  =  1], 

VIA**)]  =  EtVbfil D  =  1,  p(x)]\D  =  1]  +  V[E[y!  -  y0| D  -  1,  p(x)]\ D  =  1], 

where  we  use  the  variance  decomposition  result  given  in  Section  A. 8.  In  general  x  is  a 
better  predictor  than  p(x),  which  implies  that 

E[V[yi|D=  l,x]|£>=  l]<E[V[y!|D=  1,  p(x)]\D  =  1], 

V[E[yi  -  y0|£>  =  1,X]|£»  =  1]  >  V[E[y!  -  y0\D  =  1,  p(x)]\D  =1], 
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because  conditioning  on  x  loses  less  information  than  conditioning  on  p(\),  which  is 
a  particular  function  of  x.  Thus  the  second  comparison  favors  the  propensity  score 
method  whereas  the  first  term  comparison  favors  the  use  of  x  over  p(x). 

A  helpful  practical  guide  and  computer  programs  for  implementing  the  calculations 
of  ATET  are  provided  by  Becker  and  Ichino  (2002). 


25.5.  Differences-in-Differences  Estimators 

Chapters  2  and  3  discussed  the  setting  of  a  natural  experiment  or  a  quasi-experiment 
in  which  a  treatment  variable  undergoes  a  change  that  can  be  viewed  as  an  exogenous 
variation  in  a  treatment  variable.  The  treated  group  can  be  compared  to  an  untreated 
comparison  group. 

In  some  cases  one  has  data  on  the  treated  and  the  comparison  (control)  groups 
both  before  and  after  the  experiment.  Then  for  the  ith  treated  case  the  change  in 
the  outcome  is  measured  by  \yia  —  y,/;|  I)la  =  1]  and  that  for  the  untreated  group 
is  measured  by  [y,u  —  y,/;  D,a  =  0|.  Then  the  differences-in-differences  measure 
[yia  ~  yib\Dia  =  1]  -  [yia  ~  }’ib I Dia  =  0],  where  subscripts  a  and  b  denote  “after” 
and  “before”  the  experiment  occurs,  forms  the  basis  of  an  estimate  of  the  treatment 
effect.  This  method  has  been  introduced  in  Sections  3.4.2  and  22.6. 

Consider  a  model  with  a  fixed  effect  </>,  and  a  drift  term  St,  where  the  pre-treatment 
and  post-treatment  outcomes  are  given  by,  respectively, 

yu,  o  =  <Pi  +  8,  +  Sit,  (25.51) 

yu,\  =  yu,0  +  «,  (25.52) 


so  that 


yit  =  (1  -  Dit)yitfi  +  Dityit'i, 


(25.53) 


—  <Pi  +  St  +  o/Dj,  +  Sj,. 


The  preceding  equations  are  for  t  =  a,  b\  (25.51)  is  for  the  group  that  did  not  get 
treated  and  (25.52)  is  for  the  group  that  did  get  treated.  Using  the  “before”  and  “after” 
formulation,  we  obtain  the  treatment  effect 

a  =  E [yia  -  yib\Dia  =  1]  -  E [yia  -  yih\Dia  =  0]  (25.54) 

=  {E[via|D,a=  I  ]  —  E[y(a|  /),„  =  0]} 

-  {E[yih\Dia  =  1]  -  E[yih\Dia  =  0]} , 


where  the  differencing  step  eliminates  the  fixed  effect  a  and  the  drift  8,. 

There  are  alternatives  to  taking  differences.  One  alternative  is  to  control  directly  for 
pretreatment  outcome  difference  between  treatment  and  control  groups  by  regression. 
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For  example,  replace  </>,  in  (25.51)  by  xJ/3  +  'yyn,  to  obtain 

yia,  o  =  X;/3  +  7  yib  +  Sa  +  sia,  (25.55) 

yia,\  —  X?/3  -f-  'yyib  T  T  & D \a  -f-  £ja. 

Estimates  of  a  are  constructed  by  regressing  posttreatment  outcomes  on  a  constant, 
pretreatment  outcomes,  x, ,  and  Dia .  The  interpretation  of  o'  as  a  causal  parameter  relies 
on  the  assumption  that  after  controlling  for  x,  and  y^,  the  treatment  effect  completely 
accounts  for  the  posttreatment  difference  between  the  treated  and  control  groups.  The 
fixed  effect  is  given  a  linear  functional  form,  whereas  a  matching  strategy  can  be  based 
on  weaker  assumptions. 

Our  previous  results  could  actually  be  based  on  quasi-experimental  data.  For  ex¬ 
ample,  compare  people  in  one  state  with  one  law  with  those  in  a  different  state  with 
a  different  law,  and  use  control  functions  for  the  state  effects.  The  new  element  is  the 
addition  of  data  before  the  experiment.  By  the  assumption  that  the  two  states  have  the 
same  drift  term,  we  can  use  the  differences-in-differences  method  to  eliminate  the  state 
effects  for  which  otherwise  we  would  need  control  functions. 


25.6.  Regression  Discontinuity  Design 

Identification  of  the  treatment  effect  can  sometimes  be  facilitated  by  either  a  natu¬ 
ral  experiment  or  using  data  generated  in  a  quasi-experimental  setting.  Regression- 
discontinuity  (RD)  design  is  an  example  of  a  quasi-experimental  design  in  which  the 
probability  of  receiving  a  treatment  is  a  discontinuous  function  of  one  or  more  underly¬ 
ing  variables.  Such  a  design  can  arise  in  circumstances  where  a  treatment  is  triggered 
by  an  administrative  or  organizational  rule.  For  example,  Angrist  and  Lavy  (1999) 
study  the  effect  of  class  size  on  student  test  scores,  taking  advantage  of  the  data  gener¬ 
ated  under  the  operation  of  “Maimonides  Rule,”  which  stipulates  that  the  class  be  split 
when  it  reaches  a  specific  threshold  size.  Van  der  Klaauw  (2003)  estimates  the  effect  of 
financial  aid  offers  on  the  student’s  decision  to  attend  a  college,  exploiting  the  identi¬ 
fying  information  provided  by  a  discontinuity  in  the  administrative  rule  that  relates  the 
aid  to  the  student’s  SAT  score  and  the  grade  point  average.  These  econometric  appli¬ 
cations  are  predated  by  Thistlethwaite  and  Campbell  (1960),  who  analyzed  the  impact 
of  student  scholarships  on  career  aspirations,  exploiting  the  fact  that  the  awards  are 
made  only  when  the  student’s  test  score  exceeds  a  threshold;  see  also  Trochim  (1984). 
The  treatment  here  follows  Van  der  Klaauw  (2003). 


25.6.1.  Discontinuous  Treatment  Assignment  Mechanism 

In  the  case  of  an  RD  design,  there  is  additional  information  about  the  selection  rule: 
It  is  known  that  the  treatment  assignment  mechanism  depends  (at  least  in  part)  on  the 
value  of  an  observed  continuous  variable  relative  to  a  given  threshold,  or  cutoff  score, 
in  such  a  way  that  the  corresponding  probability  of  getting  treated  (propensity  score) 
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Regression  Discontinuity  Example 
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Figure  25.1:  Regression-discontinuity  design:  example. 


is  a  discontinuous  function  of  this  variable  at  the  cutoff  score.  Figure  25.1  illustrates  a 
sample  generated  by  the  RD  design. 

In  the  simplest  RD  design,  called  the  sharp  RD  design,  individuals  are  assigned  to 
treatment  and  control  groups  solely  on  the  basis  of  an  observed  continuous  measure  S, 
called  the  selection  or  assignment  variable.  Those  falling  below  the  distinct  cutoff  S 
do  not  receive  treatment  and  constitute  the  control  group  whereas  those  that  are  above 
the  cutoff  receive  treatment  (D  =  1).  That  is,  the  treatment  assignment  occurs  through 
a  known  and  measured  deterministic  decision  rule:  D,  =  1  [  .S',  >  ,S'|.  In  Figure  25.2  the 
sharp  RD  design  is  shown  as  a  solid  line  (see  Van  der  Klaauw,  2003). 

In  the  sharp  RD  design 

E[u\D,S]  =  E[u\S],  (25.56) 

where  u  denotes  the  error  in  the  outcome  equation.  Because  S  is  the  only  systematic 
determinant  of  D,  S  will  capture  any  correlation  between  D  and  u. 

With  D,  =  D  (Sj)  =  1  [.S,  >  5],  a  dependence  between  D,  and  m,  would  make  OLS 
an  inconsistent  estimator  of  a.  As  previously  mentioned,  one  approach  to  estimating 
the  treatment  effect  in  such  a  case  is  to  specify  and  to  include  the  conditional  mean 
function  E[u\D,  S ]  as  a  “control  function”  in  the  outcome  equation.  Thus 

Ji  =  P  +  oi Dj  +  k  (Sj)  +  Si,  (25.57) 

where  e,  =  y,  —  E[y,  1 1), ,  S,  ] .  If  k(S)  is  correctly  specified,  the  regression  will  consis¬ 
tently  estimate  a. 

If  k  (S)  is  linear  then  a  will  be  estimated  by  the  distance  between  the  two  linear 
parallel  regression  lines  at  the  cutoff  point,  which  in  this  case  equals  the  difference 
between  the  two  intercepts.  It  is  an  unbiased  estimate  of  the  common  treatment  effect 
if  the  control  function  is  linear. 

In  the  more  general  case  of  varying  treatment  effects  in  which  the  coefficient 
of  D  represents  E[a,|5],  or  local  LATE  discussed  in  Section  25.7.1,  where  k(S) 
is  a  specification  of  E[w|S]  +  (E[o!,-|S]  —  E[a,|S])l[S  >  5],  where  1[S  >  5]  =  1  if 
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the  condition  in  parenthesis  is  satisfied.  Incorrect  specification  of  k  ( S )  leads  to  in¬ 
consistency,  and  hence  a  semiparametric  specification  may  be  tried,  for  example, 
k  (S)  =  Yl  Jj= i  rh  ■  where  J  may  be  determined  by  a  suitable  method. 

The  variable  S  may  be  related  to  the  outcome  y,  which  would  automatically  cause 
( y ,  D)  to  be  related  even  when  there  is  no  causal  link  between  the  two  variables.  This 
contrasts  with  random  assignment  that  avoids  such  dependence. 

Whereas  random  assignment  makes  treatment  and  control  groups  similar  in  re¬ 
spects  other  than  the  receipt  of  treatment,  the  sharp  RD  design  makes  them  differ¬ 
ent,  at  least  in  terms  of  their  S  value.  This  violates  the  “strong  ignorability”  as¬ 
sumption  of  Rosenbaum  and  Rubin  (1983),  which  also  requires  the  overlap  condition, 
0  <  Pr[D  =  1|5]  <  1,  whereas  in  the  sharp  RD  design  model  Pr[D  =  1|S]  e  [0,  1]. 


25.6.2.  Identification  and  Estimation  under  RD  Design 

The  main  intuition  is  that  the  sample  of  individuals  in  the  small  neighborhood  of  the 
cutoff  will  be  similar  to  a  randomized  experiment  at  the  cutoff  point  because  they 
have  essentially  the  same  S  value.  Those  just  below  the  cutoff  are  expected  to  be  very 
similar  to  those  just  above  it.  A  comparison  of  the  average  y  value  of  those  just  above 
and  those  just  below  the  cutoff  will  produce  an  estimate  of  the  average  treatment  effect. 

Increasing  the  interval  around  the  cutoff  will  bias  the  estimate  of  the  treatment  ef¬ 
fect,  especially  if  the  assignment  variable  was  itself  related  to  the  outcome  variable 
conditional  on  treatment  status.  If  an  assumption  about  the  functional  form  of  this 
relationship  can  be  made  then  one  can  “use  more  observations  and  extrapolate  from 
above  and  below  the  cutoff  point  to  what  a  tie -breaking  randomized  experiment  would 
have  shown.  This  double  extrapolation,  combined  with  exploitation  of  the  ‘random¬ 
ized  experiment’  around  the  cutoff  point,  has  been  the  main  idea  behind  regression- 
discontinuity  analysis”  (Van  der  Klaauw,  2003,  p.  1258). 

Observe  that  in  this  RD  design, 

lim  E[y|S]  -  lim  E[y|S]  =  a  +  lira  E[u\S]  -  lim  E[u\S],  (25.58) 

sis  ’  St  5  '  SI'S  Sts 

A  more  formal  way  of  assuming  that,  in  the  absence  of  treatment,  individuals 
in  a  small  interval  around  S  would  have  similar  average  outcomes  is  to  specify  the 
following: 

Assumption  Al.  The  conditional  mean  function  E[w|S]  is  continuous  at  S. 
Assumption  A2.  The  mean  treatment  effect  function  E[a,  1 5]  is  right  continuous  at  S : 

yi  =  /3  +  uDj  +  k  (Si)  +  Si,  (25.59) 

where  s,  =  yi  -  E[v,  |  D, .  S,-]. 

Then  the  result  in  (25.58)  follows. 
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25.6.3.  Fuzzy  RD  Design 

Here  the  treatment  assignment  depends  on  the  selection  variable  in  a  stochastic  man¬ 
ner.  The  relation  between  the  propensity  score  Pr[£)  =  1 1 5]  is  known  to  have  a  discon¬ 
tinuity  at  S.  A  possible  consequence  of  misassignment  relative  to  the  cutoff  value  is 
a  fuzzy  design,  with  values  of  S  near  the  cutoff  point  appearing  both  in  the  treatment 
and  control  groups.  Alternatively,  the  assignment  may  be  based  on  additional  variables 
observed  by  the  treatment  administrator  but  unobserved  by  the  program  evaluator.  So 
relative  to  the  sharp  RD  design,  the  fuzzy  RD  design  selection  depends  on  both  ob¬ 
servables  and  nonobservables.  In  Figure  25.2  the  fuzzy  RD  design  is  shown  as  a  dashed 
line. 

We  can  still  exploit  the  discontinuity  in  the  selection  rule  to  identify  the  treat¬ 
ment  effect  under  assumption  Al.  If  E  [«|5]  is  continuous  at  S,  then  lims^v  E[y|S]  — 
lim^  E[  v|,S'|  =  cy[ limSp.  E[D|S]  —  lim^  E[D|S]].  Therefore,  the  treatment  effect 
a  is  identified  by 

lims;s  E[D|5]  — limstS  E[D\S]’  (  j 

where  the  denominator  lims^  E [  D \ S \  —  lim^y  E [  D \ S \  ^  0  because  of  the  known 
discontinuity  of  E[D|5]  at  S. 

In  the  case  of  heterogeneous  treatment  responses  we  need  additional  assumptions. 

Assumption  A2*.  The  average  treatment  effect  function  E[a,-|5]  is  continuous  at  S. 
Assumption  A3.  D,  is  independent  of  a,  conditional  on  S  near  S: 

yi  =  p  +  «E[  A  |  Si]  +  k(Si)  +  Si ,  (25.6 1 ) 

where  e,  =  y  -t  —  E[y,  |  /J, ,  S,  ]  and  k(Sj )  is  a  specification  of  E[n,  1 5,  ] . 


25.6.4.  A  Two-Stage  Estimator 

If  Cov[  D,  u]  A  0,  OLS  regression  will  produce  a  biased  estimate  of  a.  However,  the 
following  can  lead  to  a  consistent  estimator.  Consider 

yt  =  P  +  «E [Dt |5,]  +  k  (S,-)  +  £, ,  (25.62) 

where  e,-  =  y,  —  E[ y,  1 5,  |  and  k(Sj)  is  a  specification  of  E[w,-|S,-]. 

Stage  1:  Specify  propensity  score  function  for  a  fuzzy  RD  design  as 

E[Di\Si]  =  /(Sf)  +  y  1  [S,  >  SI  (25.63) 

where  f(Sj)  is  some  continuous  function  of  S  that  is  continuous  at  S.  By  specifying 
the  functional  form  of  /  (or  by  estimating  /  semi-  or  nonparametrically)  we  can 
estimate  y,  the  discontinuity  in  the  propensity  score  function  at  S. 

Stage  2:  The  control  function-augmented  outcome  equation  is  then  estimated  with  D, 
replaced  by  the  first-stage  estimate  of  E[D,  |S,]  =  Pr[D,  =  1|S,-];  this  estimate  will 
be  discontinuous  in  S  whereas  the  included  control  function  for  k  (S)  would  be 
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Figure  25.2:  Regression  Discontinuity  Design;  treatment  assignment  in  sharp  (solid)  and 
fuzzy  (dashed)  designs. 


continuous  in  S  at  S.  Under  correct  specification  of  f(Sj)  and  A: (.S', )  the  two-stage 
procedure  is  consistent. 


25.7.  Instrumental  Variable  Methods 

In  recent  years  instrumental  variable  methods  have  been  strongly  advocated  as  an  al¬ 
ternative  to  MLE  and  other  strongly  parametric  methods  (Angrist,  Imbens,  and  Rubin, 
1996).  Such  an  identification  strategy  is  attractive  in  models  with  selection  on  un¬ 
observables  (see  Section  25.3.4).  In  many  applications  such  a  model  consists  of  a 
linear  equation  for  a  continuous  outcome  variable  whose  conditional  mean  and  vari¬ 
ance  structure  is  specified,  without  any  additional  distributional  assumptions.  A  lead¬ 
ing  case  has  a  continuous  outcome  dependent  upon  a  vector  of  regressors  x  and  a  single 
endogenous  treatment  (dummy)  variable  (D)  that  represents  the  decision  to  participate 
in  the  treatment.  This  equation  is  called  the  participation  or  selection  equation.  In  a 
more  general  setting,  one  may  have  a  limited  dependent  or  discrete  outcome  and  there 
may  be  multiple  treatment  variables. 

The  discussion  that  follows  overlaps  with  the  coverage  of  IV  estimation  in  several 
places  in  this  book  and  with  that  of  selection  models.  The  IV  approach  allows  us  to 
develop  another  “local”  variant  of  the  ATE  parameter. 


25.7.1.  Local  ATE  (LATE) 

We  reconsider  the  simple  linear  formulation.  The  outcome  equation  is  a  linear  function 
of  observable  variables  x  and  a  participation  indicator  D: 

yt  —  x'jf3  +  a  Dj  +  n,-,  (25.64) 
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and  the  participation  decision  depends  on  a  single  variable  z,  referred  to  as  an  instru¬ 
ment, 


D*  =  Y  o  +  YiZi  +  Vj , 


(25.65) 


where  D*  is  a  latent  variable  with  its  observable  counterpart  D,  generated  by 


There  are  two  assumptions: 


0  if  D*  <  0, 
1  if  D*  >  0. 


(25.66) 


1.  There  is  a  variable  z  that  appears  in  the  equation  for  I)  that  does  not  appear  in  the 
equation  for  y.  It  may  be  continuous  or  discrete,  and  in  a  special  case  it  is  binary.  The 
exclusion  of  regressors  x  from  the  participation  equation  is  a  simplification.  The  simul¬ 
taneous  presence  of  z  in  the  participation  equation  and  its  exclusion  from  the  outcome 
equation  is  referred  to  as  the  exclusion  restriction.  This  model  structure  is  familiar  from 
Chapter  16  on  selection  models. 

2.  Cov[z,  n]  =Cov[m,  z]  =Cov[x,«]  =  0,  and 


Cov[D,  z]  7^  0. 

Together  with  the  first  assumption,  this  assumption  implies,  as  previously  emphasized, 
that  y  depends  on  z  only  through  D,  and  D  depends  on  z  in  a  nontrivial  fashion.  Hence 
we  use  the  notation  D  (z)  to  emphasize  the  dependence  of  D  on  z. 


Under  these  assumptions  IV  estimation  of  (25.64)  yields  consistent  estimates  of 
(/3,  a).  Let  z!  =  z  +  <5,  8^0.  Then  noting  that  E[£)|x,  D  (z)]  =  Pr[ D  (z)  =  1]  and 
taking  expectations  we  obtain 

E[y|x,  D  (z)]  =  x'/3  +  aPr [D  (z)  =  1], 

E[y|x,  D  (z')]  =  x'/3  +  aPr[D  (z')  =  1], 
where,  after  subtraction,  we  have 


E[y|x,  z']  -  E[y|x,  z]  =  «  [Pr[D  (z')  =  1]  -  Pr [D  (z)  =  1]] . 


Solving  the  equation  for  a  yields  the  expression  for  the  local  average  treatment 
effect  (LATE),  analyzed  by  Imbens  and  Angrist  (1994): 


E[y|x,  z']  —  E[y|x,  z] 

Pr[U  (z')  =  1]  —  Pr[D  (z)  =  1]’ 

Ir(x)  [E[ylx.  2']  -  E[y|x,  z]]  clF  (x|x  e  R  (x)) 
fR(x)  [Pr [D  (z0  =  1]  -  Pr[U  (z)  =  1]]  dF  (x|x  e  R  (x))  ’ 
E[ylz']  —  E[y  |z] 

Pr[D(z')=  l]-Pr[D(z)=  1]’ 


(25.67) 


where  the  second  line  involves  averaging  over  x,  whose  support  is  denoted  by  R  (x) . 
This  expression  is  well  defined  if  Pr[  D  (z')  =  1]  —  Pr[  D  (z)  =  1]  ^  0.  The  sample 
analogue  of  this  expression  is  the  ratio  of  the  mean  difference  between  the  treated  and 
the  nontreated  divided  by  the  change  in  the  proportion  treated  owing  to  the  change  in  z. 
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This  estimator  is  an  IV  estimator.  Using  the  results  on  the  asymptotic  normality  of  the 
IV  estimator,  we  can  obtain  confidence  intervals  for  the  LATE  parameter. 

The  qualifier  “local”  in  LATE  is  justified  because  it  measures  the  treatment  effect  on 
the  “compliers”  that  are  induced  to  participate  in  the  treatment  as  a  result  of  the  change 
in  z.  LATE  depends  on  the  particular  values  of  z  used  to  evaluate  the  treatment  and  on 
the  particular  instrument  chosen.  The  group  of  “movers”  may  not  be  representative  of 
the  whole  treated  population,  let  alone  the  whole  population.  Consequently,  the  LATE 
parameter  may  not  be  informative  about  the  consequences  of  large  policy  changes 
brought  about  by  changes  in  instruments  different  from  those  historically  observed. 

For  binary  instrument  the  LATE  and  the  IV  estimates  are  equivalent,  as  shown  in 
Angrist  et  al.  (1996,  p.  447).  If  more  than  one  instrument  appears  in  the  participation 
equation,  as  when  there  exist  overidentifying  restrictions,  the  LATE  parameter  esti¬ 
mated  for  each  instrument  will  in  general  differ.  However,  a  weighted  average  may  be 
constructed. 

The  foregoing  analysis  applies  when  the  treatment  effect  does  not  vary  with  indi¬ 
viduals.  If,  however,  the  treatment  effect  is  heterogeneous,  then  there  is  a  potential  for 
confounding  the  variation  induced  by  z:  Is  the  observed  variation  due  to  ^-differences 
or  o'-diffcrences?  Under  heterogeneity  the  idiosyncratic  component  of  the  treatment 
effect, 


Mi, i  =  Mi, o  +  D,(a,  (x,-)  -  a  (x,  )), 


is  a  function  of  a,  (x, )  —  a(x,),  see  (25.27).  Then  the  previous  assumptions  are  not 
enough  to  determine  ATE  or  ATET.  A  solution  to  this  difficulty  is  the  addition  of  the 
monotonicity  assumption  as  an  additional  identifying  condition.  Essentially  this  says 
that  the  instrument  affects  participation  in  a  monotonic  fashion,  so  that  if  on  average 
participation  is  more  likely  given  Z  =  w  than  given  Z  =  z,  then  anyone  who  would 
participate  given  Z  =  z  must  also  participate  given  Z  =  w. 


25.7.2.  Relation  to  Other  Measures 

The  IV  estimator  of  a  is  the  same  as  what  we  would  estimate  by  using  a  two-stage 
least-squares  procedure  in  which  we  first  estimate  the  probability  of  receiving  treat¬ 
ment,  E [D  =  l|x,  z],  and  then  run  a  regression  of  the  outcome  y  on  x  and  the  fitted 
probability,  assuming  of  course  that  the  treatment  effect  is  additive.  Consider  a  special 
case  of  the  IV  estimator  in  which  x  is  a  scalar  and  equals  one,  and  z  is  a  scalar  dummy 
variable  that  denotes  eligibility  to  participate  in  the  treatment;  z  =  1  implies  eligibility 
and  z  =  0  implies  noneligibility. 

We  can  partition  the  population  into  four  categories:  compliers  (C),  always-takers 
(A),  never-takers  (N),  and  defiers  (D).  Compliers  are  induced  to  receive  treatment  by 
being  eligible,  always-takers  will  receive  treatment  whether  or  not  they  are  eligible, 
never-takers  refuse  treatment  regardless  of  eligibility,  and  defiers  are  contrarians  who 
refuse  treatment  if  eligible  and  take  treatment  if  not.  Assume  that  there  are  no  defiers, 
so  there  are  just  three  categories. 
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The  Wald  estimator  of  the  treatment  effect  is  defined  by 


TEwald  = 


E  [yi\Zj  =  1]  —  E[y,|z,-  —  0] 

E  [Di\Zi  =  l]-E[D,\zi  =0]’ 


(25.68) 


whose  numerator,  expressed  as  a  weighted  average  of  treatment  effects  on  the  three 
categories,  with  weights  equal  to  the  probability  of  being  in  each  category,  is 


Pr[C]{E[y,  |z;  =  1,  C]  —  E[yi\Zi  =  0,  C]} 

+  Pr[A]{E[y,k/  =  1,  A]  -  E[yt\Zl  =  0,  A]} 
+  Pr[Af] {E[ vf-  \zi  =  1  ,N]~  E[yi\Zi  =  0,  N]} 
=  Pr[C]{E[y; \Zi  =  1,  C]  -  E[yi\Zi  =  0,  C]}. 


The  result  in  the  final  line  follows  because  the  terms  corresponding  to  always-takers 
and  never-takers  are  identically  zero.  The  denominator  in  (25.68)  is  the  probability  of 
compliance,  Pr[C].  Therefore, 


TEwald  =  E[yu|z(-  =  1,  C]  -  E[yo,,k;  =  0,  C].  (25.69) 


If  we  compare  TEWald  with  the  LATE  measure,  we  find  that  LATE  is  a  measure  of 
the  effect  of  treatment  on  the  subgroup  of  those  at  the  margin  of  participating,  denoted 
as  compliers. 

In  empirical  economic  applications  the  concept  of  a  marginal  impact  caused  by 
variation  in  a  continuous  variable,  measured  by  a  partial  derivative,  is  well  entrenched 
and  is  replaced  by  a  discrete  analogue  when  the  variation  in  the  causal  variables  is  dis¬ 
crete.  Thus  a  marginal  treatment  effect  (MTE)  measure  conditional  on  x  is  defined 
as 


MTE  = 


9E[y|x,  Z] 
9Pr[D  =  l|x,  Z] 


z=~ 


(25.70) 


Heckman  and  Vytlacil  (2002)  show  that  ATE,  ATET,  and  LATE  are  all  averages 
of  MTE  taken  over  different  subsets  of  the  Z  support,  or  subpopulations.  ATE  is  the 
expected  value  of  MTE  over  the  full  support  of  z,  including  where  participation  rate  is 
zero  or  one.  ATET  excludes  the  support  of  z  where  participation  does  not  occur.  LATE 
is  the  average  of  MTE  over  an  interval  of  z  where  participation  rates  differ. 


25.7.3.  IV  Estimation  in  a  Model  with  Heterogeneous  Treatment  Effect 

We  now  consider  a  model  that  allows  for  selection  on  ubobservables  and  heteroge¬ 
neous  treatment  effect.  The  context  is  of  a  linear  model  with  an  endogenous  treat¬ 
ment  variable  whose  coefficient  is  random,  see  Bjorklund  and  Moffitt  (1987).  Such  a 
model,  which  can  be  motivated  by  the  consideration  that  the  treatment  effect  is  not  con¬ 
stant  across  the  treated,  has  been  considered  by  Wooldridge  (1997)  and  Heckman  and 
Vytlacil  (1998). 

We  write  the  model  as  a  simultaneous  equations  model  with  the  outcome  variable 
yi  that  depends  upon  treatment  variable  yz .  For  simplicity  the  treatment  variable  y?  is 
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taken  to  be  continuous.  Given  instrument  z  and  exogenous  variable  x,  ,  the  model  is  as 
follows: 


Vi,;  =  (a  +  Vj)y2i  +  x'/Jj  +  st  (25.71) 

=  otyn  +  xj/ 3[  +  Si  +  vty2i 
=  ViJ2  +  ocyn  +  +  Wi, 

y2i  =  YZi  +  x'/32  4-  Vi  ,  (25.72) 

where  w,  =  e,-  +  u,  (>>2,  —  y2).  The  marginal  response  of  y\  with  respect  to  a  change 
in  v2  is  (a  +  vt ),  which  varies  across  individuals,  thus  permitting  a  heterogeneous 
treatment  effect. 

Suppose  E[e,-|x,,  >'2,J  =  E[u,|x;,  y2,J  =  0.  Then  E[e,  +  n,y2i  |x,-,  y2/]  =  0,  and 
V[e,  +  v-ty2i |x(- ,  y2,  |  depends  on  x,  and  hence  is  heteroskedastic.  Then  the  least- 
squares  estimator  of  {a,  (3 \)  is  consistent  but  not  efficient.  This  follows  from  the  as¬ 
sumed  exogeneity  of  y2. 

We  next  consider  the  case  where  the  treatment  variable  is  endogenous.  The  follow¬ 
ing  assumptions  are  made: 

E[e;|x;,  Zi]  =  Efailxj,  Zi]  =  E[ti,-|xj,  z,-]  =  0,  (25.73) 

E[e?|xi;  z,]  =  a\\  E[vf\Xi,  Zi]  =  cr2v;  E[rjf  |x,-,  Zi]  =  rf.  (25.74) 

Endogeneity  is  introduced  by  permitting  correlation  between  v  and  17.  Specifically, 
assume  that  E[u,  ]  =  p?;(  ,  which  would  hold  if  (v,  r/)  were  bivariate  normal  dis¬ 
tributed.  Under  these  assumptions,  z  is  a  valid  instrument,  and  x  is  exogenous.  The 
exclusion  of  z  from  the  yi  equation  is  an  identifying  restriction.  Therefore  instrumen¬ 
tal  variable  estimation  of  (25.71)  with  instruments  (z,  x)  is  a  natural  estimator.  Note, 
however,  that  the  condition  for  consistent  estimation  is  E[wj,  |x, ,  z,  \  =  0.  The  first  com¬ 
ponent  of  w/ ,  S/ ,  is  uncorrelated  with  Zi  by  assumption;  the  second  component  of  in,  is 
ljj  (y2l  —  v2)>  which  may  at  first  sight  seem  to  to  be  correlated  with  z.i  on  which  y2,  de¬ 
pends.  If  so,  the  IV  estimator  would  be  inconsistent.  However,  it  can  be  shown  that 
the  IV  estimator  is  consistent  under  the  preceding  assumptions.  The  key  step  in 
the  argument  involves  showing  that  E[iy  y2;  U,  ]  =  E  [  u,  _y2,  | .  a  result  established  in 
Wooldridge  (1997)  by  applying  the  law  of  iterated  expectations;  thus, 

E[uy2|z]  =  E[E[ny2|z,  i;]|z]  (25.75) 

=  E[y2E[n|z,  ij]  |z]  =  E[pi;y2|z] 

=  pE[)?2|z]  =  per2  =  E[vy2]. 


Although  the  IV  estimator  is  consistent  under  the  assumptions  given  here,  it  is  not 
efficient  because  of  the  heteroskedastic  error.  Hence  heteroskedastic-consistent  stan¬ 
dard  errors  should  be  used.  Finally,  we  have  not  tackled  the  issue  of  sensitivity  of  esti¬ 
mated  treatment  effects  to  the  choice  of  instruments  when  the  response  to  treatment  is 
heterogeneous. 
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25.7.4.  Endogenous  Treatment  in  Nonlinear  Models 

Consider  how  the  analyses  of  Sections  25.3  and  25.7  change  if  the  outcome  of  a  job 
training  program  were  employment  rather  than  earnings,  or  was  duration  to  job  place¬ 
ment.  Alternatively,  suppose  that  posttraining  a  significant  proportion  remains  unem¬ 
ployed  and  has  zero  earnings,  so  that  the  sample  is  a  mixture  of  those  with  zero  and 
positive  earnings  and  hence  will  be  nonnormal.  How  should  one  extend  the  previous 
methods  to  handle  the  complications  of  nonlinearity  and  nonnormality? 

The  specification  and  estimation  of  nonlinear,  nonnormal  models  of  treatment  and 
outcome  with  selection  is  an  issue  that  occurs  frequently  in  microeconometrics.  As  in 
linear  models,  a  major  focus  in  such  models  is  on  the  effect  of  an  endogenous  treat¬ 
ment  variable  on  an  economic  outcome.  The  model  specification  comprises  an  out¬ 
come  equation  with  a  structural-causal  interpretation  and  other  equations  that  model 
the  generating  process  of  treatment  variables.  There  are  two  broad  approaches  to  this 
problem,  a  parametric  one  that  relies  on  likelihood-based  (including  Bayesian)  meth¬ 
ods  and  a  semiparametric  one  that  relies  on  GMM  or  linearized  IV  methods. 

The  typical  setup  is  illustrated  by  the  following  selected  examples.  In  labor  eco¬ 
nomics,  Bingley  and  Walker  (2001)  examine  the  effect  of  duration  of  husbands’  un¬ 
employment  on  wives’  discrete  labor  supply  choices.  Here  the  treatment  variable  is 
nonnegative  and  possibly  censored  or  truncated.  Pitt  and  Rosenzweig  (1990)  study  the 
effect  of  endogenous  health  status  of  infant  children  on  their  mothers’  main  daily  ac¬ 
tivity;  here  the  treatment  variable  is  discrete  and  the  outcome  is  continuous.  Carrasco 
(2001)  examines  the  effect  of  childbirth  on  labor  force  participation  of  women.  In 
treatment-outcome  models  related  to  fertility,  Jensen  (1999)  examines  the  effect  of 
contraceptive  use,  a  discrete  variable,  on  duration  between  births,  a  limited  dependent 
variable.  Olsen  and  Farkas  (1989)  examine  the  effect  of  childbirth  on  the  hazard  of 
dropping  out  of  school.  In  health  economics,  Kenkel  and  Terza  (2001)  examine  the 
effect  of  physician  advice  (discrete)  on  the  consumption  of  alcohol  (continuous  and 
nonnegative).  Gowrisankaran  and  Town  (1999)  study  the  effect  of  hospital  choice  on 
the  hazard  of  death  in  a  hospital.  In  health  economics  the  impact  of  health  insurance 
choice  on  health  care  utilization,  sometimes  measured  as  an  expenditure  variable  and 
sometimes  as  a  count  of  number  of  units  of  some  specific  type  of  service  such  as  doctor 
visits  or  hospital  admissions,  is  frequently  studied  using  the  framework  of  a  two-part 
model  (Deb  and  Trivedi  1997).  Terza  (1998)  and  van  Ophem  (2000)  model  the  effect 
of  household  vehicle  ownership  on  counts  of  trips.  Many  other  examples  can  be  cited. 

These  models  share  many  statistical  features.  First,  both  treatment  and  outcome  pro¬ 
cesses  are  nonnormal  and  nonlinear:  multinomial,  count,  discrete,  or  censored.  Second, 
in  each  model  the  treatment  is  endogenous.  Finally,  investigators  often  have  good  a  pri¬ 
ori  reasons  for  choosing  particular  parametric  marginal  models  for  both  treatments  and 
outcomes.  However,  the  transition  from  given  marginal  distributions  to  a  joint  model 
for  treatment  and  outcome  is  an  essential  step  that  is  potentially  problematic  when 
nonnormal  multivariate  distributions  are  involved.  Often  the  marginal  models  have  no 
(or  very  restrictive)  tractable  multivariate  counterparts  (e.g.,  in  models  of  counts  and 
durations).  In  others,  treatment  and  outcome  are  from  different  statistical  families  (e.g., 
treatment  being  a  multinomial  and  the  outcome  being  a  hazard  rate)  and  so  analytically 
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tractable  multivariate  distributions  often  do  not  exist.  Because  of  the  specialized  nature 
of  applications  in  this  area,  this  topic  is  not  pursued  any  further  here. 

25.8.  Example:  The  Effect  of  Training  on  Earnings 

The  National  Supported  Work  (NSW)  demonstration  project,  conducted  in  the  1970s, 
measured  the  impact  of  training  on  earnings  by  a  randomized  experiment  that  assigned 
some  individuals  to  receive  training  (a  treatment  group)  and  others  to  receive  no  train¬ 
ing  (a  control  group).  The  effect  of  training  could  then  be  measured  by  direct  compar¬ 
ison  of  sample  means  of  posttreatment  earnings  for  the  treatment  and  control  groups. 

As  was  discussed  in  Chapter  3,  randomized  experiments  are  relatively  rare  in  the 
social  sciences.  More  often  an  observational  sample  is  used  with  some  individuals 
observed  to  receive  a  treatment  while  others  do  not.  Comparison  of  the  treated  with  the 
nontreated  must  then  control  for  differences  in  observed  characteristics,  and  possibly 
in  unobserved  characteristics. 

To  determine  the  adequacy  of  standard  microeconometric  methods  for  observational 
data,  Lalonde  (1986)  contrasted  outcomes  for  the  NSW  treated  group  with  those  for 
control  groups  drawn  from  two  national  surveys.  He  obtained  results  that  differed  sub¬ 
stantially  from  the  experimental  results  that  contrasted  the  NSW  treated  and  control 
groups,  and  he  concluded  that  the  observational  methods  were  unreliable. 

Dehejia  and  Wahba  (1999,  2002)  reanalyzed  a  subset  of  the  Lalonde  data  using  al¬ 
ternative  matching  methods,  which  they  argued  led  to  conclusions  from  observational 
data  that  were  considerably  closer  to  those  from  experimental  data.  In  this  section  we 
use  their  data  from  Dehejia  and  Wahba  (1999)  to  illustrate  the  application  of  methods 
introduced  in  Sections  25.2  to  25.5  that  control  only  for  selection  on  observables. 

25.8.1.  Dehejia  and  Wahba  Data 

The  treated  sample  is  one  of  185  males  who  received  training  during  1976-1977.  The 
control  group  consists  of  2,490  male  household  heads  under  the  age  of  55  who  are 
not  retired,  drawn  from  the  PSID.  Dehejia  and  Wahba  (1999)  call  these  two  samples 
the  RE74  subsample  (of  the  NSW  treated)  and  the  PSID-1  sample  (of  nontreated). 
The  treatment  indicator  variable  D  is  defined  as  D  =  1  if  training  is  received  (so  the 
observation  is  in  the  treated  sample)  and  D  =  0  if  no  training  was  received  (and  the 
observation  is  in  the  control  sample). 

Summary  statistics  for  key  variables  are  given  in  Table  25.3.  The  treated  group 
differs  considerably  from  the  control  group,  being  disproportionately  black  (84%)  with 
less  than  a  high  school  degree  (71%)  and  unemployed  in  the  pre-treatment  year  1975 
(71%).  Estimates  of  the  effect  of  training  should  control  for  these  differences. 

25.8.2.  Control  Function  Approach 

Various  estimates  of  the  effect  of  training  on  earnings  are  given  in  Table  25.4. 

The  outcome  of  interest  is  posttreatment  earnings,  RE78.  One  possible  measure  of 
the  effect  of  training  is  the  mean  difference  in  RE78  between  NSW  treated  and  PSID 
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Table  25.3.  Training  Impact:  Sample  Means  in  Treated  and  Control  Samples a 


Variable 

Definition 

Treated 

Control 

AGE 

Age  in  years 

25.82 

34.85 

EDUC 

Education  in  years 

10.35 

12.12 

NODEGREE 

1  if  EDUC  <  12 

0.71 

0.31 

BLACK 

1  if  race  is  black 

0.84 

0.25 

HISP 

1  if  Hispanic 

0.06 

0.03 

MARR 

1  if  married 

0.19 

0.87 

U74 

1  if  unemployed  in  1974 

0.60 

0.10 

U75 

1  if  unemployed  in  1975 

0.71 

0.09 

RE74 

Real  earnings  in  1974  (in  1982  $) 

2,096 

19,429 

RE75 

Real  earnings  in  1975  (in  1982  $) 

1,532 

19,063 

RE78 

Real  earnings  in  1978  (in  1982  $) 

6,349 

21,554 

D 

1  if  received  training  (treatment) 

1.00 

0.00 

Sample  size 

185 

2,490 

a  Data  are  the  same  as  in  table  1  of  Dehejia  and  Wahba  (1999).  The  treated  group  is  the  RE74  subsam¬ 
ple  of  the  NSW.  The  control  group  is  the  PSID-1  sample  of  male  household  heads  under  55  years 
and  not  yet  retired.  Treatment  occurred  in  1976-1977. 


control  individuals,  leading  to  the  estimate  $6,349  —  $21,554  =  —$15,205.  This  is 
called  a  treatment-control  comparison  estimator  as  it  mimics  the  analysis  in  an 
experimental  setting.  It  can  equivalently  be  computed  as  the  coefficient  of  the  treat¬ 
ment  indicator  D  in  OLS  regression  of  RE78  on  an  intercept  and  D,  using  a  combined 
treatment-control  sample. 

The  large  treatment  estimate  is  misleading  as  it  mostly  reflects  the  difference  in  the 
types  of  individuals  in  the  two  samples  -  the  control  sample  individuals  are  not  good 
controls.  This  difference  can  be  controlled  for  by  including  pretreatment  characteristics 
as  regressors,  and  estimating  by  OLS 

RE78,  =^/3  +  aDi  +  Ui,  i  =  1, . . . ,  2675.  (25.76) 

This  leads  to  a  much  smaller  estimated  treatment  effect  a  =  $218  when,  following 
Dehejia  and  Wahba,  the  regressors  x  are  specified  to  be  an  intercept,  AGE,  AGESQ, 
EDUC,  NODEGREE,  BLACK,  HISP,  RE74,  and  RE75.  This  approach  is  called  the 
control  function  estimator  in  Section  25.3.3. 


25.8.3.  Differences  in  Differences 

A  second  approach  is  a  before-after  comparison,  which  looks  at  the  difference  be¬ 
tween  posttreatment  earnings  RE78  and  pretreatment  earnings  RE75.  Using  mean 
earnings  for  the  treated  group  leads  to  the  difference  estimate  $6,349  —  $1,532  = 
$4,817. 

This  estimate  may  be  misleading  as  it  reflects  all  changes  over  this  time  period, 
such  as  an  improved  economy,  and  not  just  training.  The  difference-in-differences 
estimator,  considered  in  Section  25.5,  additionally  calculates  a  similar  quantity 
for  the  control  group,  $21,554  —  $19,063  =  $2,491,  and  uses  this  as  a  measure  of 
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Table  25.4.  Training  Impact:  Various  Estimates  of  Treatment  Effect 


Method 

Definition 

Estimate 

St.  Errora 

Treatment-control  comparison 

RE780=1  -  RE78o=0 

-15,205 

656 

Control  function  estimator 

a  from  OLS  regression  (25.76) 

218 

768 

Before-after  comparison 

RE78fl=1  -  RE75fl=1 

4,817 

625 

Differences-in-differences 

a  from  OLS  regression  (25.77) 

2,326 

749 

Propensity  score 

See  Section  25.8.4 

995 

- 

a  Standard  errors  for  the  first  four  estimates  are  computed  using  heteroskedastic-consistent  standard  errors  from 
the  appropriate  OLS  regression. 


nontreatment  related  changes  over  time  in  earnings,  so  that  the  change  over  time  solely 
due  to  treatment  is  $4,817—  $2,491  =  $2,326. 

The  DID  estimator  can  be  shown  to  be  equivalent  to  the  estimate  of  a  in  the  OLS 
regression 

REif  =  </>  +  (5D78,-,  +  yo/D/  +  aD78/f  x  D,  +  iq,  i  =  1 . 2675,  t  —  75,  78. 

(25.77) 

Here  RE,  75  denotes  earnings  in  the  pretreatment  period  and  RE,  7g  denotes  earnings 
in  the  posttreatment  period,  so  the  regression  is  one  with  5,350  earnings  observations. 
The  indicator  variable  D78,,  equals  one  in  the  posttreatment  period,  the  indicator  vari¬ 
able  Dj  equals  one  if  the  individual  is  in  the  treated  sample,  and  the  interaction  term 
D78,r  x  Dj  equals  one  for  treated  individuals  in  the  posttreatment  period. 

More  generally,  the  intercept  </>  in  (25.77)  can  be  replaced  by  x',/3.  This  makes  no 
difference  in  this  example  where  regressors  are  time-invariant  so  that  x,,  =  x,  .  The 
method  can  be  applied  to  repeated  cross-section  data  (see  Section  22.6.2)  as  it  does 
not  require  that  individuals  in  the  treated  and  control  groups  be  observed  in  both  1975 
and  1978. 


25.8.4.  Simple  Propensity  Score  Estimate 

A  third  approach  compares  the  outcome  RE78  for  a  treated  individual  with  a  counter- 
factual  prediction  of  RE78  if  the  same  treated  individual  had  not  in  fact  received  the 
treatment.  The  initial  treatment-control  estimate  of  $15,205  is  an  oversimplified  ex¬ 
ample  that  uses  as  counterfactual  the  average  of  RE78  in  the  control  group  ($21 ,554). 
Better  counterfactuals  can  be  generated  by  specifying  a  regression  model.  For  exam¬ 
ple,  the  regression  (25.76)  specifies  E[RE78|x]  to  equal  x'/3  +  a ,  if  treated,  with  coun¬ 
terfactual  x'(3,  if  not  treated.  This  places  restrictions  on  both  the  effect  of  regressors 
x  and  on  the  effect  of  treatment,  which,  conditional  on  x,  is  assumed  to  be  constant 
across  individuals. 

The  treatment  effects  literature  emphasizes  counterfactuals  that  do  not  rely  on 
such  strong  assumptions.  An  obvious  approach  is  to  compare  treated  and  untreated 
individuals  with  the  same  value  of  x,  but  in  practice  such  matching  on  regressors 
is  not  possible  if  several  regressors  are  felt  to  be  relevant  and  these  regressors  take  a 
number  of  different  values. 
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Post-treatment  Earnings  against  Propensity  Score 


Figure  25.3:  Training  impact:  post-treatment  earnings  plotted  against  propensity  score  by 
treatment  status.  Only  observations  with  common  support  for  the  propensity  score  are 
included.  Observations  with  earnings  over  $20,000  are  excluded  from  the  scatter  plot,  for 
readability,  though  they  are  included  in  the  nonparametric  regression. 


Instead,  it  can  be  sufficient,  given  assumptions  detailed  in  Sections  25.3  and  25.4, 
to  match  on  the  propensity  score,  defined  as  the  conditional  probability  of  treatment 
Pr [D  =  l|x].  For  this  example  we  estimate  using  only  data  for  the  initial  year  1975 
the  logit  model 

Pr[D,  =  l|x;]  =  A(xf/3),  i  =  1, . . . ,  2675,  (25.78) 

where,  from  Section  14.2,  A(z)  =  ez /( I  +  ez),  and  following  Dehejia  and  Wahba 
(1999)  the  regressors  chosen  are  AGE,  AGESQ,  EDUC,  EDUCSQ,  NODEGREE, 
BLACK,  HISP,  MARR,  RE74,  RE75,  RE74SQ,  RE75SQ,  and  U74*BLACK. 

Figure  25.3  plots  posttreatment  earnings  RE78  against  the  propensity  score,  sep¬ 
arately  for  the  treated  and  control  samples.  Considering  just  the  propensity  score  (x 
axis)  it  is  clear  that  most  observations  in  the  control  sample  have  very  low  propen¬ 
sity  score,  an  expected  result  given  the  Table  25.3  data  that  treated  individuals  were 
disproportionately  black,  unemployed,  low-education  individuals. 

Turning  to  the  posttreatment  outcome  RE78  (y  axis),  we  see  that  the  treatment  effect 
is  estimated  as  the  difference  between  a  given  treated  individual  ( D  =  1)  and  a  control 
sample  individual  ( D  =  0)  with  the  same  (predicted)  propensity  score.  Each  panel 
in  Figure  25.3  includes  a  fitted  nonparametric  regression  of  RE78  on  the  propensity 
score.  The  treatment  effect  is  less  than  one  thousand  dollars  over  much  of  the  range 
of  propensity  score,  though  it  is  considerably  larger  and  positive  for  propensity  score 
around  0.80. 

There  are  many  ways  to  implement  this  approach  of  comparing  individuals  with 
similar  propensity  score  and  then  averaging  over  all  treated  individuals.  One  strategy 
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is  to  match  a  treated  individual  with  the  control-sample  individual  who  has  the  closest 
propensity  score.  This  approach  was  labeled  as  the  nearest-neighbor  matching  in  Sec¬ 
tion  25.4.4.  A  simpler  strategy  is  to  stratify  data  by  propensity  score,  denoted  p(x),  and 
let  the  counterfactual  be  the  within-strata  average  of  RE78  for  the  control  group.  For 
example,  if  a  treated  observation  has  propensity  score  p(x)  =  0.35  then  the  counter- 
factual  may  be  the  average  of  p(x)  for  control  group  observations  with  0.30  <  p(x)  < 
0.40.  The  total  effect  is  then  w’s(RE78v  o=i  —  RE78s,d=o),  where  RE78V  /J=|  and 
RE78v  /)=()  denote  the  strata  5  averages  of  RE78  for,  respectively,  the  treated  and  con¬ 
trol  observations,  and  the  weights  ws  equal  the  fraction  of  treated  observations  in  each 
stratum.  A  simple  stratification  scheme  uses,  say,  10  equally  spaced  strata  with  0.0  < 
p(x)  <  0.1,  0.1  <  p(x)  <  0.2,  and  so  on.  This  was  referred  to  as  stratification  match¬ 
ing  in  Section  25.4.4.  This  procedure  should  be  restricted  to  cases  where  the  propensity 
scores  for  the  treated  and  control  samples  overlap,  see  Section  25.4.3.  Here  the  propen¬ 
sity  score  ranges  from  0.0005  to  0.9420  for  the  treated  sample  and  from  0.0000  to 
0.9371  for  the  control  sample,  leading  to  dropping  of  1,423  control  group  individ¬ 
uals  and  8  treated  individuals.  The  resulting  estimated  total  effect  is  $995  given  in 
Table  25.4. 


25.8.5.  Matching  Using  Propensity  Scores 

As  mentioned  in  Section  25.4,  other  matching  strategies  include  radius  and  kernel 
matching,  which  are  also  relatively  easy  to  implement.  The  remainder  of  this  chapter 
details  these  and  other  approaches,  with  emphasis  on  propensity  score  methods. 

Fitted  Propensity  Score 

The  fitted  propensity  score  is  obtained  using  two  different  logit  specifications,  from 
Dehejia  and  Wahba  (1999)  and  Dehejia  and  Wahba  (2002),  respectively.  The  specifi¬ 
cations  for  propensity  scores  are  detailed  at  the  bottom  of  Table  25.6.  In  the  only  de¬ 
parture  from  Dehejia  and  Wahba  (1999,  2002),  a  constant  term  is  included  in  our  logit 
models.  The  estimated  coefficients,  not  presented  to  save  space,  show  an  expected  sign 
pattern. 

Matching  Algorithms  and  Balancing 

An  important  practical  issue  is  the  choice  of  an  appropriate  matching  algorithm  based 
on  propensity  scores  that  ensures  that  balancing  condition  (25.9)  is  met.  Dehejia  and 
Wahba  (2002,  p.  161)  provide  an  algorithm  that  starts  with  a  parsimonious  logit  model 
to  estimate  p(x).  The  algorithm  works  as  follows.  The  data  are  sorted  according  to 
/?(x).  The  sample  observations  are  stratified  such  that  within  a  stratum  the  /i(x)  for 
treated  and  control  units  are  close.  For  example,  initially  a  rough  grid  with  equal  ranges 
may  be  used.  Within  each  stratum  the  equality  of  means  between  treated  and  control 
units  should  be  tested  for  each  covariate.  If  there  is  no  statistically  significant  differ¬ 
ence,  then  the  regressors  are  balanced  between  the  treated  and  control  groups  and  one 
can  stop.  If,  for  some  stratum,  there  is  no  balance,  then  for  the  unbalanced  stratum  a 
finer  grid  is  used  to  achieve  balance.  If  there  are  many  unbalanced  strata,  then  the  orig¬ 
inal  logit  model  is  reestimated  with  an  improved  specification  that  includes  interaction 
and  higher  order  terms  among  the  regressors. 
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Table  25.5.  Training  Impact:  Distribution  of 
Propensity  Score a  for  Treated  and  Control  Units  Using 
Dehejia  and  Wahba’s  (1999)  Specification 


Minimum  p(x ) 

Treated 

Untreated 

Total 

0.000364 

9 

960 

969 

0.10 

10 

56 

66 

0.20 

14 

33 

47 

0.40 

24 

22 

46 

0.60 

33 

7 

40 

0.80 

95 

8 

103 

Total 

185 

1086 

1271 

a  From  the  second  row,  for  example,  the  propensity  score  lies  between 
0.10  and  0.20  for  10  treated  and  56  untreated  individuals. 


Using  the  software  of  Becker  and  Ichino  (2002),  Dehejia  and  Wahba’s  (2002)  algo¬ 
rithm  is  used  to  compute  the  propensity  scores.  In  all  of  the  cases  noted,  the  propen¬ 
sity  score  computation  has  been  restricted  to  the  common  support  region  by  testing 
the  balancing  property  using  those  observations  whose  propensity  scores  lie  in  the 
intersection  of  the  supports  of  the  propensity  score  of  the  treated  and  the  control  units. 
This  restriction  reduces  the  original  sample  significantly.  The  size  of  the  control  group 
drops  from  2,490  units  to  1,086  for  the  Dehejia  and  Wahba  (2002)  specification. 

Table  25.5  displays  the  number  of  treated  and  control  units  in  different  blocks  after 
the  balancing  is  carried  out  by  the  procedure  just  outlined.  The  reported  results  differ 
from  those  of  Dehejia  and  Wahba  (2002)  because  the  latter  exclude  control  units  from 
NSW-PSID  composite  samples  not  on  the  basis  of  common  support  region  but  on 
the  basis  of  whether  the  estimated  propensity  score  of  a  sample  unit  is  less  than  the 
minimum  of  the  estimated  propensity  score  for  the  treated  units.  The  tables  show  that 
the  proportion  of  treated  units  to  control  units  is  very  low  for  the  first  blocks,  compared 
with  the  remaining  blocks. 

A  similar  exercise  for  the  Dehejia  and  Wahba  (1999)  specification,  not  tabulated 
for  brevity,  leads  to  similar  results.  The  control  group  has  1,146  observations.  The 
boundary  values  for  blocking  'p(x)  are  then  0.0006526,  0.05,  0.10,  0.20,  0.40,  0.60, 
and  0.80. 


ATET  Estimates  by  Matching  Methods 

A  selection  of  results  for  various  matching  methods  are  summarized  in  Table  25.6.  The 
nearest  neighbor  estimate  of  ATET  for  the  Dehejia  and  Wahba  (2002)  specification  is 
$2,385,  and  for  the  Dehejia  and  Wahba  (1999)  specification,  it  is  approximately  $560. 
The  performance  of  stratification  and  kernel  matching  is  also  mixed,  the  estimates  of 
ATET  ranging  from  $1,452  to  $2,156. 

For  comparison,  Dehejia  and  Wahba’s  (2002)  ATET  estimates  are  reproduced  in 
Table  25.7.  We  also  note  that  the  benchmark  estimate  of  the  treatment  effect  is  $1,794. 
It  is  obtained  by  regressing  RE78  on  D  for  the  Dehejia  and  Wahba’s  (2002)  version  of 
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Table  25.6.  Training  Impact:  Estimates  of  ATET 

Matching 

Procedure 

Number  Number  in 

Treated  Control 

ATET 

Standard 

Error 

%of 

$1794 

Dehejia  and  Wahba  (2002) 

specification0 

Nearest  neighbor 

185 

53 

2385 

1209° 

133 

Radius,  r  =  0.001 

54 

517 

-7815 

1118* 

-436 

Radius,  r  =  0.0001 

24 

92 

-9333 

2282* 

-520 

Radius,  r  =  0.00001 

15 

19 

-2200 

2986* 

-123 

Stratification 

185 

1086 

1452 

1041° 

81 

Kernel 

185 

1058 

1309 

975° 

73 

Dehejia  and  Wahba  (1999) 

specification* 

Nearest  neighbor 

185 

57 

560 

1098° 

31 

Radius,  r  —  0.001 

57 

583 

-9358 

997* 

-522 

Radius,  r  =  0.0001 

27 

76 

-7847 

2066* 

-437 

Radius,  r  =  0.00001 

16 

13 

223 

4551* 

12 

Stratification 

185 

1146 

2156 

814° 

120 

Kernel 

185 

1146 

1518 

890° 

85 

“  Logit  Model:  Pr[treat  =  1]  =  /;(CONSTANT,  AGE,  AGE2,  EDU,  EDU2,  MARRIED,  NODEGREE,  BLACK, 
HISPANIC,  RE74,  RE742,  RE75,  U74,  U75,  U74*HISPANIC). 
h  Logit  Model:  Pr[treat=  1]  =  /i(CONSTANT,  AGE,  AGE2,  EDU,  EDU2  MARRIED,  NODEGREE,  BLACK, 
HISPANIC,  RE74,  RE742,  RE75,  RE752,  RE74*RE75,  U74*BLACK). 
c  Bootstrapped  standard  errors  with  200  replications. 
d  Analytical  standard  errors. 
e  ATET/1794  x  100. 


the  NSW  sample  of  both  participants  and  nonparticipants.  It  is  clear  that  the  reported 
ATET  estimates  in  this  table  differ  significantly  from  those  of  Dehejia  and  Wahba 
(2002),  as  well  as  from  the  benchmark  actual  experimental  estimate.  For  the  Dehejia 
and  Wahba  (2002)  specification,  the  nearest-neighbor  estimator  is  very  close  to  the 
benchmark  estimate  and  is  even  better  than  the  results  of  Dehejia  and  Wahba  (2002) 
in  terms  of  reduced  bias. 

For  stratification  and  kernel  estimates,  the  bias  is  larger.  For  the  radius  matching 
estimator,  this  bias  is  worse,  and  gives  negative  estimates  of  the  treatment  effect  as 
opposed  to  the  positive  estimates  that  Dehejia  and  Wahba  (2002)  found  using  caliper 
matching.  The  difference  between  our  radius  matching  and  the  caliper  matching  of 
Dehejia  and  Wahba  (2002)  is  that  in  the  latter  scheme,  when  a  given  treated  unit  does 
not  have  a  match  within  the  given  caliper,  matching  is  then  done  with  the  nearest 
comparison  unit  outside  of  the  given  caliper.  In  our  case,  if  such  a  situation  arises,  we 
ignore  treated  units  that  have  no  match  in  the  prespecified  radius.  This  illustrates  the 
sensitivity  of  the  matching  estimators  to  assumptions. 

The  robustness  of  ATET  estimates  across  specifications  can  be  evaluated  in  terms 
of  the  ratio  of  ATET  and  the  benchmark  estimate,  given  in  the  last  column  of  Table 
25.6.  With  the  exception  of  the  stratification  matching  estimator,  the  ratio  varies  widely 
over  the  two  specifications.  For  example,  the  nearest-neighbor  estimator  is  133%  of  the 
benchmark  estimator  in  the  Dehejia  and  Wahba  (2002)  specification,  but  only  31%  in 
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Table  25.7.  Training  Evaluation:  Dehejia  and  Wahba's 
(2002)  Estimates  ofATET 


Matching  Procedure 

ATET 

Standard  Error 

Nearest  neighbor 

1890 

1202 

Radius,  r  =  0.001 

1824 

1187 

Radius,  r  =  0.0001 

1973 

1191 

Radius,  r  =  0.00005 

1928 

1196 

Radius,  r  —  0.00001 

1893 

1198 

the  Dehejia  and  Wahba  ( 1999)  specification.  Similarly,  except  for  the  kernel  estimator, 
the  ATET  estimates  are  sensitive  to  the  propensity  score  used. 

Whether  matching  methods  work  well  depends  on  the  suitability  of  the  propen¬ 
sity  score  model  for  the  treatment  and  control  groups  (Dehejia  and  Wahba,  2002). 
However,  there  is  clearly  an  interaction  between  the  methods  and  the  propensity  score 
model. 


25.9.  Bibliographic  Notes 

Early  economic  applications  of  matching  and  differences-in-differences  methods  to  program 

evaluation  include  Ashenfelter  (1978)  and  Ashenfelter  and  Card  (1985).  Treatment  evaluation 

is  currently  a  very  active  and  fast-moving  area  of  econometrics  research. 

25.2  Angrist  et  al.  (1996)  make  useful  connections  between  the  concepts  and  terminology  in 
the  medical  and  the  econometrics  literature. 

25.3  Heckman  and  Robb  (1985)  consider  the  estimation  of  program  impacts  in  a  variety  of  data 
settings,  in  the  presence  of  selection.  See  also  Bjorklund  and  Moffitt  (1987).  Heckman  and 
Hotz  (1989)  also  argue  strongly  that  one  needs  to  subject  the  results  to  several  specification 
tests  to  assess  their  robustness  and  to  evaluate  the  impact  of  selection  bias.  For  example, 
they  suggest  the  use  of  multiple  comparison  groups  to  evaluate  the  sensitivity  of  the  results 
based  on  a  single  control  group.  Most  of  this  earlier  work  is  parametric  in  approach.  More 
recently  nonparametric  methods  have  been  used  also. 

25.4  Heckman,  Ichimura,  and  Todd  (1997)  and  Heckman  et  al.  (1998)  study  and  apply  match¬ 
ing  estimators.  The  important  result  concerning  conditioning  on  the  propensity  score  is 
given  in  Rosenbaum  and  Rubin's  (1983,  theorem  2).  Efficient  estimation  of  ATE  using 
estimated  propensity  scores  is  analyzed  in  Hirano,  Imbens,  and  Ridder  (2003).  Dehejia 
and  Wahba  (2002)  apply  propensity  score  matching  methods  to  a  variant  of  the  Lalonde 
(1986)  data  set.  The  experimental  data  are  matched  with  observations  from  the  CPS  and 
the  PSID.  Smith  and  Todd  (2004)  reanalyze  the  data  used  by  Dehejia  and  Wahba  using 
a  number  of  different  variants  of  propensity  score  estimators.  They  highlight  the  biases 
associated  with  alternative  propensity  score  estimators  and  emphasize  the  importance  of 
high-quality  data  in  bias  minimization.  Becker  and  Ichino  (2002)  provide  an  overview  of 
some  propensity  score  matching  estimators.  They  also  provide  a  set  of  STATA  programs, 
with  illustration,  that  can  be  used  for  estimating  ATET.  The  February  2004  issue  of  the 
Quarterly  Journal  of  Economics  includes  a  symposium  on  the  econometrics  of  matching. 

25.6  Hahn,  Todd,  and  Van  der  Klaauw  (2001)  analyze  identification  of  treatment  effects  in  the 
RD  model  under  weak  assumptions. 
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25.7  Imbens  and  Angrist  (1994)  analyze  the  properties  of  the  LATE  estimator.  Angrist  et  al. 
(1996)  discuss  the  use  of  IV  methods  and  make  a  connection  with  the  LATE  measure  of 
treatment  impact.  The  article  is  followed  by  a  lively  discussion  that  gives  a  spectrum  of 
views  on  the  IV  estimator  as  well  as  literature  connections,  see  also  Heckman  (1997). 
Angrist  (2001)  discusses  some  simple  strategies  for  dealing  with  endogenous  dummies  in 
nonlinear  outcome  models  with  nonnormal  outcomes.  The  paper  is  followed  by  discus¬ 
sion  and  comments  that  analyze  the  pros  and  cons  of  the  linearized  IV  approach.  There  is 
lack  of  consensus  on  the  most  promising  among  the  competing  approaches.  Heckman,  To¬ 
bias,  and  Vytlacil  (2003)  develop  estimators  for  treatment  effects  within  a  latent  variable 
framework.  Vella  and  Verbeek  (1999)  compare  the  IV  approach  with  a  control  function 
approach  that  includes  a  selection  bias  correction  term. 


- Exercises - 

25-1  (Adapted  from  Heckman,  1996)  Consider  the  treatment-outcome  model  y  = 
x'/3  +  ad+  s,  where  cMs  a  binary  indicator  variable  taking  the  value  1  if  treat¬ 
ment  is  assigned  randomly  and  0  if  treatment  is  not  assigned  (also  randomly). 

(a)  Is  randomized  treatment  a  sufficient  condition  for  identification  of  a? 

(b)  Is  randomized  treatment  a  sufficient  condition  for  identification  of  a  and  (3 ? 

25-2  In  the  previous  problem  randomization  refers  to  treatment.  Here  we  consider 
randomized  eligibility  for  receiving  the  treatment.  Now  e  =  1  means  that  an  in¬ 
dividual  is  randomly  made  eligible  and  e=  0  means  randomly  made  ineligible. 
Show  that  in  this  case,  given  Pr[cf  =  1  |x]  ^  0,  the  treatment  effect  is  given  by 
E[y|e=1,x]-  E[y|e=  0,  x]/ Pr[d=  1  |x]. 

25-3  Consider  the  nonlinear  treatment  outcome  model  E[y|x,d]  =exp(x'/3+  ad), 
where  d  is  a  binary  treatment  indicator.  Suppose  that  we  have  available  con¬ 
sistent  estimates  of  (/3,  a)  and  an  estimated  covariance  matrix  V[/3,  a].  Assume 
that  the  estimator  is  asymptotically  normal.  Outline  a  bootstrap  or  a  Monte  Carlo 
algorithm  for  estimating  the  ATE  parameter  and  its  asymptotic  variance  given 
(x/,  di),  i  =  1, ...,  N. 

25-4  Consider  the  nonlinear  treatment  outcome  model  E[lny|x,d)  =x'/3+  ad, 
where  d  is  a  binary  treatment  indicator.  Suppose  that  we  have  available  con¬ 
sistent  estimates  of  (/3,  a)  and  an  estimated  covariance  matrix  V[/3,  oi].  Suppose 
we  are  interested  in  estimating  the  ATE  in  terms  of  y  rather  than  In  y.  Suggest 
an  estimation  method  and  discuss  its  consistency  property. 

25-5  In  this  chapter  the  empirical  illustration  used  the  PSID  control  group  and  the 
NSW  treatment  group.  Dehejia  and  Wahba  (2002)  used  two  control  groups. 
There  is  another  control  group  available  based  on  the  CPS.  In  this  exercise 
you  will  be  asked  to  replicate  some  of  the  calculations  reported  here  using  the 
CPS  control  group  in  place  of  the  PSID  sample. 

(a)  Generate  a  table  similar  to  Table  25.3.  Compare  the  NSW  group  with  the 
CPS  controls  in  terms  of  age,  ethnic  composition,  educational  attainment, 
and  pretreatment  earnings. 

(b)  The  differences  between  the  treatment  and  control  groups  can  be  viewed 
using  the  estimated  propensity  score,  as  was  done  in  Section  25.8.  Using 
the  approach  of  Section  25.8.4  estimate  the  propensity  score  for  the 
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NSW-CPS  composite  sample,  incorporating  the  covariates  linearly  and 
with  higher  order  terms,  as  in  Dehejia  and  Wahba  (2002).  Ignoring  those 
comparison  units  whose  propensity  scores  are  less  than  the  minimum  of 
the  treated  units,  compare  the  two  sets  of  propensity  scores  using  a  his¬ 
togram.  Comment  on  the  goodness  of  match  with  comparison  units  in  dif¬ 
ferent  propensity  score  intervals  (“bins”). 

(c)  Using  the  matching  methods  described  and  implemented  in  Sections  25.8.4 
and  25.8.5  (especially  nearest-neighbor,  stratification,  or  interval  match¬ 
ing,  kernel  matching,  and  radius  matching),  construct  a  table  similar  to 
Table  25.6.  Comment  on  the  estimates  of  ATET  and  compare  them  with 
those  based  on  the  PSID  comparison  group. 
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Measurement  Error  Models 


26.1.  Introduction 

Problems  of  measurement  error  pervade  all  econometrics.  In  microeconometrics,  a 
common  source  of  the  measurement  error  problem  comes  from  incorrect  response  to  a 
survey  question,  incorrect  coding  of  a  correct  response,  and  the  use  of  a  correctly  mea¬ 
sured  variable  as  a  proxy  for  another  theoretically  valid  but  unobserved  variable  (e.g., 
using  observed  income  as  a  proxy  for  “normal  income”)-  Questions  that  seek  sensitive 
information  may  elicit  partial  or  incorrect  responses.  That  is,  a  measurement  error  is 
triggered  by  unobservables  (or  latent  variables)  when  such  variables  are  replaced  by 
proxy  variables. 

Here  are  some  examples.  Consider  the  problem  of  testing  for  the  presence  of  gender 
bias  in  a  study  of  earnings.  The  obvious  approach  is  to  regress  a  measure  of  earnings 
on  a  categorical  gender  variable  while  controlling  for  qualifications,  age,  experience, 
and  so  forth.  However,  the  most  relevant  variable  may  be  an  individual’s  on-the-job 
productivity,  which  may  not  be  directly  observed  and  a  proxy  may  be  used  instead. 
Therefore,  the  impact  of  measurement  error  on  inferences  about  the  gender  discrim¬ 
ination  is  an  important  issue.  Studies  of  individual  demand  for  goods  and  services 
feature  concepts  such  as  “economic  cost”  or  “full  price  of  a  service.”  However,  these 
are  rarely  directly  measured  in  published  data  and  must  be  constructed  by  the  econo¬ 
metrician  prior  to  model  estimation.  Inevitably  their  measurement  is  subject  to  error. 

There  are  virtually  no  models  discussed  in  this  book  that  are  protected  from  the 
problem  of  measurement  errors.  Binary  outcome  endogenous  or  exogenous  variables 
are  potentially  subject  to  classification  errors;  transition  and  count  data  collected  from 
retrospective  surveys  are  affected  by  recall  errors;  data  on  relatively  unambiguous  vari¬ 
ables  such  as  hourly  earnings  and  household  expenditure  are  distorted  by  deliberate 
exaggerations  and/or  reporting  errors.  Unlike  aggregate  data  where  aggregation  may 
result  in  some  cancellation  of  measurement  errors,  for  individual-level  data  measure¬ 
ment  errors  persist. 

In  the  first  part  of  this  chapter  we  study  the  consequences  of  measurement  errors 
and  estimation  strategies  for  remedying  the  consequences.  Both  linear  and  nonlinear 
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models  are  considered.  Although  it  is  more  realistic  to  acknowledge  that  the  problem 
usually  occurs  in  combination  with  others,  it  is  convenient  for  exposition  to  suppose 
that  the  only  problem  confronting  the  econometrician  is  measurement  error. 

Broadly  speaking  the  consequence  of  errors  of  measurement  is  a  failure  to  iden¬ 
tify  the  parameter  of  interest.  The  issue  of  fixing  the  problem  is  complex.  One  may 
consider  simply  omitting  the  relevant  variable  in  the  model  or  substituting  a  proxy  for 
the  true  measure.  There  are  at  least  two  important  reasons  for  not  doing  so  except  in 
extreme  cases.  First,  if  the  variable  is  of  central  interest,  then  omission  lends  to  se¬ 
rious  omitted  variable  bias,  so  one  is  substituting  one  type  of  problem  for  another, 
and  identification  is  still  not  possible.  Second,  in  a  linear  regression,  using  a  proxy  for 
the  latent  variable  will  have  smaller  asymptotic  bias  than  simply  omitting  the  variable 
from  the  model,  provided  the  measurement  errors  are  random  and  independent  of  the 
true  regressor  (McCallum,  1972).  Ignoring  the  variable  provides  inferior  estimates. 
However,  using  the  proxy  still  gives  inconsistent  estimates  even  though  the  biases 
are  smaller. 

The  essential  insight  underlying  the  solution  of  the  measurement  error  problem 
is  that  to  recover  the  parameter  of  the  latent  variable  and  to  identify  the  model,  it 
is  necessary  to  have  extraneous  information  in  the  form  of  additional  assumptions 
about  the  measurement  error  or  obtain  additional  data  and  to  use  this  information  after 
invoking  plausible  assumptions.  This  is  a  popular  approach.  However,  when  additional 
data  are  unavailable,  an  econometric  model  makes  a  good  alternative. 

Measurement  errors  have  potentially  very  serious  consequences  since  in  many  cases 
they  lead  to  regression  parameters  becoming  unidentified.  For  example,  Card  (2001) 
reviews  empirical  evidence  on  the  coefficient  of  schooling  on  earnings  and  finds  that 
the  typical  downward  bias  is  of  the  order  of  25-35%.  The  precise  consequences  of 
measurement  errors  may  depend  on  the  functional  form  of  the  model,  how  the  errors 
enter  the  model  (e.g.,  additively  or  multiplicatively),  and  the  data  structure  under  con¬ 
sideration.  The  solution  of  the  problem  resulting  from  measurement  errors  typically 
requires  introduction  of  additional  information  into  the  model,  either  in  the  form  of 
additional  data  or  additional  assumptions. 

It  is  convenient  to  organize  the  discussion  of  measurement  error  models  into  sep¬ 
arate  sections  on  linear  and  nonlinear  models,  and  then  to  consider  special  cases. 
Sections  26.2  and  26.3  are  devoted  to  linear  regression.  Section  26.4  covers  nonlin¬ 
ear  regression.  Section  26.5  discusses  some  Monte  Carlo  examples.  Essential  insights 
provided  by  linear  models  provide  a  useful  basis  for  understanding  the  results  for  non¬ 
linear  models.  In  all  cases  clearer  results  are  usually  available  for  specific  models. 


26.2.  Measurement  Error  in  Linear  Regression 

Measurement  error  in  the  regressors,  also  called  error-in-variables,  is  an  important 
topic  as  it  leads  to  inconsistency  of  the  OLS  estimator  even  if  the  measurement  error 
has  zero  mean.  Measurement  error  in  the  regressors  is  often  said  to  lead  to  bias,  but  we 
use  the  stronger  term  inconsistency  as  the  bias  does  not  disappear  as  the  sample  size 
goes  to  infinity. 
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Measurement  error  models  have  a  broad  scope  and  they  cover  situations  in  which 
the  measurement  error  affects  the  right-hand-side  variables  (“regressors”),  or  the  left- 
hand-side  variable  (“outcome”),  or  both.  Hausman  (2001)  refers  to  them  as  “problems 
from  the  right”  and  “problems  from  the  left.”  In  the  latter  case,  usually  referred  to  as 
the  classic  errors-in-variable  model,  the  relationship  of  interest  is  between  the  outcome 
y  and  covariates  (W,  X*),  where  W  is  measured  without  error  and  X*  is  not  observed 
but  a  proxy  for  it,  denoted  X,  is  available.  The  question  of  interest  is  whether  an 
estimated  relation  between  y  and  (W,  X)  provides  a  satisfactory  basis  for  inference 
regarding  X*. 

In  the  statistical  literature  it  is  conventional  to  distinguish  between  the  functional 
and  structural  approaches  to  measurement  error  models.  If  X*  denotes  the  true  un¬ 
observed  covariates,  then  the  functional  approach  regards  these  as  unknown  fixed  con¬ 
stants  (parameters).  In  the  structural  approach  they  are  treated  as  random  variables. 
Carroll,  Ruppert,  and  Stefanski  (1995)  further  distinguish  between  functional  model¬ 
ing  in  which  only  minimal  assumptions  are  made  about  the  Xs,  regardless  of  whether 
they  are  fixed  or  random,  and  structural  modeling  in  which  parametric  assumptions  are 
made  regarding  the  distribution  of  the  Xs.  Functional  measurement  error  models  are 
examples  of  models  with  infinitely  many  nuisance  parameters  for  which  the  maximum 
likelihood  method  has  well-known  deficiencies  (discussed  in  the  panel  data  chapters). 
This  distinction  is  less  common  in  the  econometrics  literature. 

The  magnitude  of  the  inconsistency  can  be  substantial  in  applications.  There  is  a 
particularly  extensive  discussion  of  measurement  error,  and  ways  to  control  for  it,  in 
econometric  studies  of  the  determinants  of  individual  earnings. 


26.2.1.  Classical  Measurement  Error  Model 

The  standard  measurement  error  model  has  a  continuous  dependent  variable  y  that  is  a 
linear  function  of  K  true  regressors  x*.  An  additive  measurement  error  in  y  may  cause 
no  problems  if  it  is  uncorrelated  with  the  regressors  because  it  can  be  absorbed  into 
the  emor  on  the  equation.  If  x*  were  observed  then  parameters  could  be  consistently 
estimated  by  OLS  regression  of  y  on  x*, 

y;  =  x*'/3  +  Ui, 

where  u ,  are  iid  [0,  n2 1.  Instead,  the  observed  data  are  x/  x*,  and  y  is  regressed 
on  x  rather  than  on  x*.  The  relationship  between  the  true  and  observed  regressors  is 
postulated  to  be 

X,  =  x*  +  v, ,  i  =  ,N,  (26.1) 

where  the  additive  measurement  errors  are  assumed  to  be  distributed  as 

V/~[0,  £vv].  (26.2) 

The  unobserved  true  regressors  are  assumed  to  have  mean  zero,  so  variables  are  mea¬ 
sured  as  deviations  from  mean  and  to  have  variance  matrix 

V[x*]  =  5W. 
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Note  that  x  is  an  unbiased  estimate  of  x*,  since  the  measurement  error  v  is  assumed  to 
have  mean  zero.  The  measurement  error  is  assumed  to  be  independent  of  both  x*  and 
the  regression  error  u, 

E[v;|x*]  =  E[v,jn(]  =  0.  (26.4) 


26.2.2.  Inconsistency  of  OLS 

To  consider  the  consequences  of  measurement  error  it  is  helpful  to  write  the  assumed 
dgp  for  the  classical  measurement  error  model  in  matrix  notation  as 

y  =  X*/3  +  u,  (26.5) 

X  =  X*  +  V, 

where  u,  the  equation  error,  obeys  the  conditions  E[u|X*]  =  0  and  E[uu'|X*]  =  <j21n. 
Substituting  the  second  equation  into  the  first  yields 

y  =  X/3  +  (u  —  V/3).  (26.6) 

An  OLS  regression  of  y  on  X  will  lead  to  an  inconsistent  estimate  of  (3,  since  the  error 
term  ( u  —  V/3)  is  correlated  with  the  regressor  X  through  the  measurement  error  V. 
Formally,  we  have 

plim  V_1X'(u  —  V/3)  =  plim  N~l(X*  +  V)'(u  —  V/3) 

=  -Sw/3 

*  0, 

using  N~l  V'V  =N  1  v,  v'  and  v,  iid  [0,  Xvv  |.  This  is  the  essential  source  of  incon¬ 
sistency.  Now 

plim  A_1X'X  =  plim  V_1(X*  +  V)'(X*  +  V) 

—  Sx*x.  T  ^VV> 

where  we  have  used  the  iid  property  of  x*  with  mean  zero  and  V[x*]  =  Sx»x».  Also, 

plim  IV^'X'y  =  plim  N~l(X*  +  V)'(X*/3  +  u) 

=  Sx.x,/3 
/0, 

so  that,  applying  Slutsky’s  theorem  (Appendix  A,  Theorem  A.3),  we  get 

plim  3  =  (plim  A-'X'X)-1  plimV’X'y  (26.7) 

—  (Sxx)  1  (Sxx  —  Svv)/3 

—  (3  —  (Sx*x*  +  Svv)  1  Svv/3. 

Clearly,  OLS  is  inconsistent  as  long  as  there  are  measurement  errors  and  Evv  /  0. 

For  later  reference  note  that  if  we  have  available  a  consistent  estimate  of  Evv, 
denoted  Svv,  and  if  (X'X  —  Svv)  is  positive  definite,  then  the  adjusted  least-squares 
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estimator  /3a  =  (X'X  —  Svv)-1X'y  can  be  computed.  This  formula  can  also  be  used  to 
study  the  impact  of  hypothetical  values  of  measurement  error  variances  on  the  least- 
squares  estimator. 


26.2.3.  Measurement  Error  with  a  Scalar  Regressor 

A  special  case  of  this  model  that  routinely  features  in  textbooks  involves  the  case  of 
a  single  true  or  unobserved  regressor  x*  with  variance  <r2, ,  observed  value  x,  zero- 
mean  measurement  error  v,  and  associated  variance  cr2.  That  is,  the  regression  is  y  = 
fix*  +  u ,  where  E[«|x*]  =  0,  V[«|x*]  =  or2,  and  Cov[t>,  n]  =  0,  but  in  estimating  the 
regression  x*  is  replaced  by  the  observed  variable  x. 

In  this  case,  (26.7)  specializes  to 


plim/3  =  ^  (26.8) 

<U*  +  cr  2 

1 

1  +  av/ax*  ^ 

=  i8[l-s/(H-s)], 

where  ,v  =  or 2 /or2,  is  often  referred  to  as  the  the  noise-to-signal  ratio  and  the  entire 
term  (1  +  ,v )  1  is  referred  to  as  the  reliability  ratio.  Asymptotically  (3  is  downward 
biased  toward  zero  to  an  extent  that  depends  directly  on  the  noise-to-signal  ratio.  This 
bias  is  also  called  attenuation  bias.  The  terminology  is  intuitive  since  it  suggests  that 
a  researcher’s  estimate  of  the  marginal  impact  of  change  in  x*  on  y  is  attenuated  by 
the  presence  of  measurement  error  in  x*. 

Note  also  that 


V[y|x]  =  a;  + 


r-  ~ 


>  cr,; 


This  implies  that  measurement  errors  not  only  cause  attenuation  bias  but  they  also 
inflate  the  equation  error  variance.  Unambiguously,  a  reduction  in  the  variance  of  the 
measurement  error  will  reduce  the  residual  variance  of  the  equation. 

Had  an  intercept  term  been  included  in  the  bivariate  regression  just  presented,  this 
would  bias  upward  the  least-squares  estimator  of  the  intercept,  y  —  fix,  where  ( y ,  x) 
are  sample  averages  that  are  still  consistent  estimates  of  the  respective  population 
means.  Cragg  (1994)  suggests  the  term  “contamination  bias”  for  this  effect  of  mea¬ 
surement  error  on  another  regression  parameter  in  the  equation. 

As  an  example,  consider  regression  of  log  hourly  wage  on  years  of  schooling.  Sup¬ 
pose  years  of  schooling  x*  are  measured  with  error,  and  assume  that  the  standard  de¬ 
viation  of  true  years  of  schooling  is  2  and  the  standard  deviation  of  the  measurement 
error  is  1,  so  that  cr2  =  4,  cr2  =  1,  and  or2  =  5.  Then  plim  (i  =  0.8  x  ft.  For  exam¬ 
ple,  an  OLS  estimated  slope  coefficient  of  0.04  means  that  one  more  year  of  school  is 
actually  associated  with  a  5%  higher  wage  rather  than  a  4%  higher  wage. 
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26.2.4.  Extensions 


In  extensions  and  generalizations  of  this  simple  but  elegant  result,  researchers  often 
ask  if  attenuation  bias  is  a  general  feature  of  measurement  error  models,  and  what  if 
anything  is  attenuated.  Although  the  result  does  not  necessarily  carry  over  to  more 
general  models,  it  does  provide  a  useful  benchmark.  Hausman  (2001)  has  called  the 
attenuation  bias  caused  by  measurement  error  the  “Iron  Law  of  Econometrics.” 

If  the  measurement  error  is  assumed  to  be  uncorrelated  with  the  true  unobserved 
value,  the  measurement  error  is  said  to  be  “classical.”  Although  convenient,  this  as¬ 
sumption  may  not  hold.  Indeed  in  some  cases  it  cannot  hold.  For  example,  if  x  is  a 
binary  0/1  variable,  the  measurement  error  will  be  a  classification  error.  If,  owing  to 
misclassification,  a  0  is  measured  as  a  1,  and  vice  versa,  then  the  measurement  error 
must  be  correlated  with  the  true  value. 

When  there  is  more  than  one  regressor,  let  X*  =  [x*  Z],  where  as  in  the  preceding 
case  we  assume  that  only  one  regressor  is  observed  with  measurement  error,  that  is, 
x  =  x*  +  v.  Then  the  expression  for  the  least-squares  estimator  of  the  coefficient  of  x 
becomes 


Plim^iz  = 


(!  -  Rl*,z)  +  °v 


(26.9) 


where  R2,  z  denotes  the  R2  in  the  auxiliary  regression  of  x*  on  Z.  The  formula  (26.9) 
is  essentially  the  same  as  (26.9),  provided  we  reinterpret  the  variance  of  x*  to  mean  the 
variance  after  controlling  for  or  removing  the  linear  influence  of  Z  on  x*.  Once  again 
the  inconsistency  of  the  least-squares  estimator  is  toward  zero,  though  by  a  smaller 
multiple  of  than  in  the  single  regressor  case.  The  coefficients  of  the  regressors  mea¬ 
sured  without  error  are  also  inconsistent,  in  a  direction  that  depends  on  £x.x»  (Levi, 
1973).  This  effect  can  once  again  be  thought  of  as  contamination  bias.  The  attenuation 
bias  that  is  demonstrated  in  these  special  cases  depends  critically  on  the  assumption  of 
additive  measurement  errors. 

When  more  than  one  regressor  is  measured  with  error  general  results  on  the  direc¬ 
tion  of  the  inconsistency  are  no  longer  available,  though  in  any  given  problem  they 
can  be  determined  given  knowledge  of  Ex»x»  and  Evv.  Most  studies  consider  measure¬ 
ment  error  in  only  one  regressor,  in  which  case  the  inconsistency  is  toward  zero.  The 
intuition  from  the  foregoing  examples  is  that  if  the  measurement  errors  on  different 
regressors  are  independent,  then  each  source  will  contribute  to  the  attenuation  bias  of 
its  “own”  coefficient,  and  all  will  contribute  to  the  inflation  bias  of  the  conditional 
variance.  Cragg  (1994)  analyzes  a  multiple  regression  model  with  measurement  errors 
and  shows  the  interactions  among  biases  from  different  sources. 


26.2.5.  Measurement  Error  in  Linear-  Panel  Models 

The  effects  of  measurement  error  in  regressors  are  compounded  when  panel  data  are 
used. 
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Assume  a  pooled  panel  model  yit  =  fix*  +  uit,  where  we  observe  x,t  =  x*  +  %, 
and  a  scalar  regressor  is  assumed  for  simplicity.  The  preceding  results  still  hold  if  we 
estimate  a  single  cross  section.  However,  if  we  estimate  using  more  than  one  year  of 
data  for  each  individual  we  need  to  adapt  the  previous  results,  since  the  regressor  x* 
will  most  likely  be  positively  correlated,  rather  than  independent  over  t  for  given  i .  For 
example,  if  we  do  the  first-differences  regression 

A  yit  =  fiAx*  +  A  uit 

-  fiAxn  +  A uit  -  fiAvu 


(see  Section  21.6)  and  define  p  =  Cor[x;* ,  ■**,_,],  then 
/  1  i  1 

plim  fi  =  fi>  + 


(  x  N  \  /  1  N  \ 

^plim  —  ^(A xit)2J  ^plim  —  '^JAxl,Ault  -  /3AxitAvit)J 


2^7 


2(1  -  p)a2x,  +  2a 
Pav 

(1  -  p)al«  +  a „2  ’ 


using  V[Au(f]  =  2V[u,r]  and  V[Av*J  =  2(1  -  p)V[x*]. 

The  inconsistency  is  larger  than  in  the  cross-section  case  if  p  >  0.  Moreover,  as 
p  — >  1,  as  can  happen  with  panel  data,  the  inconsistency  becomes  very  large.  This 
inconsistency  can  be  decreased  by  using  differences  that  are  m  >  1  lags  apart  because 
Cor[x* ,  x* t_m  \  will  be  decreasing  in  m. 


26.3.  Identification  Strategies 

It  is  conventional  to  say  that  without  additional  assumptions  the  errors-in-variables 
model  is  not  identified.  This  statement  can  be  interpreted  as  follows  in  the  context  of 
the  special  case  of  the  bivariate  model.  An  estimated  value  of  fi,  or  more  precisely  its 
probability  limit,  is  consistent  with  infinitely  many  different  combinations  of  fi  and 
s,  the  noise-to-signal  ratio.  If,  however,  additional  assumptions  or  information  can  be 
brought  to  bear  on  the  problem,  it  may  be  possible  to  rule  out  some  combinations  of 
the  underlying  parameters  that  are  consistent  with  the  observed  data  distribution.  If 
the  additional  restrictions  are  just  sufficient  to  obtain  a  unique  solution,  the  model  is 
said  to  be  exactly  identified.  If  the  additional  restrictions  are  more  than  sufficient  to 
uniquely  identify  the  model  parameters,  the  model  is  said  to  be  overidentified. 

A  general  identification  strategy  for  the  measurement  error  model  is  to  obtain 
bounds  rather  than  point  estimates  of  the  parameters  of  interest  if  there  is  no  further  a 
priori  information  or  data.  If  additional  data  and/or  information  about  measurement  er¬ 
ror  are  available  then  additional  identification  strategies,  such  as  instrumental  variables 
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estimation  or  identification  through  moment  restrictions,  become  feasible.  Additional 
information  about  the  measurement  error  is  a  broad  concept  that  includes  one  of  the 
oldest  identification  strategies,  one  using  instrumental  variables  that  link  the  true  un¬ 
observed  variables  to  their  observable  counterparts.  For  example,  additional  infor¬ 
mation  may  yield  a  consistent  estimator  for  the  attenuation  factor,  a2, /{a2,  +  a,2), 
making  it  possible  to  adjust  the  inconsistent  estimate  for  the  bias.  Finally,  repli¬ 
cated  data  or  validation  data  may  be  available,  and  these  can  yield  useful  informa¬ 
tion  about  the  moments  of  measurement  error.  These  possibilities  are  analyzed  in  the 
following. 


26.3.1.  Setting  Bounds  on  Regression  Parameters 

Reconsider  the  multiple  regression  problem  of  Section  26.2.  The  model  given  there 
is  subject  to  the  requirement  that  the  variances  Ex»x»,  Evv,  and  o2  must  be  positive 
semidefinite.  This  together  with  the  orthogonality  conditions  of  estimation  can  be  used 
to  place  some  bounds  on  the  region  in  which  the  coefficients  must  lie.  Klepper  and 
Learner  (1984)  and  Wansbeek  and  Meijer  (2000)  consider  the  problem  in  some  gener¬ 
ality.  A  more  accessible  special  case  of  the  bounds  approach  is  the  reverse  regression 
approach  considered  next. 


Reverse  Regression 

In  a  simple  bivariate  regression  model  with  variables  ( y ,  x),  direct  regression  refers 
to  the  regression  of  y  on  x,  whereas  reverse  regression  refers  to  the  regression  of 
x  on  y.  In  the  general  multivariate  regression  case  with  K  covariates,  there  is  only 
one  direct  regression  but  there  are  K  reverse  regressions.  Each  reverse  regression 
has  a  mismeasured  exogenous  variable  on  the  left-hand  side  and  the  remaining  ex¬ 
ogenous  variables  and  y  on  the  right-hand  side.  In  the  bivariate  regression  case  with 
measurement  errors,  it  is  easy  to  show  that  the  estimated  slope  coefficients  from  the 
direct  and  reverse  regressions  place  lower  and  upper  bounds  on  the  value  of  the  true 
slope  coefficient.  This  is  a  potentially  useful  result  in  analyzing  the  effects  of  measure¬ 
ment  errors.  Learner  (1978)  provides  an  excellent  discussion  of  the  logic  of  reverse 
regression. 

First,  we  consider  the  logic  of  reverse  regression  by  reference  to  a  simple  bivariate 
regression  model  with  measurement  errors: 

y  =  Px*  +  u,  (26.10) 

x  =  x*  +  v, 

where  u  is  the  regression  error  and  v  is  the  measurement  error  that  accounts  for  the 
difference  in  the  observed  variable  x  and  the  error-free  measure  x*  that  enters  the 
regression.  We  will  assume  that  u  ~  A/"[0,  a2]  and  v  ~  A/"[0,  cr2,]. 
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Following  the  structural  approach  of  Solari  (1969)  (and  Learner,  1978),  treat  x* 
as  unknown  parameters  in  the  likelihood  function.  The  joint  likelihood  given  data 
(y,  x)  is 


L  (x*,  0,  cr;,  ct2)  oc  (ct2)  n/2  exp 

x  (av)  N'2  exP 


--^(y-pxUy-fix) 

-^P-x)'P-x) 


(26.11) 


This  function  is  not  defined  at  points  that  satisfy  the  conditions  cr2  =  0  and  x*  =  x, 
or  the  conditions  a 2  =  0  and  y  =  fix* .  If  we  simply  minimize  the  well-defined  parts 
of  this  likelihood  subject  to  the  constraints  we  get  two  scalar  regression  parameters, 
fiD  =  y'x/x'x  for  the  direct  regression  and  fiR  =  y'x/v'v  for  the  reverse  regression. 
To  aid  intuition,  notice  that  if  x  is  measured  without  error  then  y  is  stochastic  and  x 
is  not,  so  direct  regression  has  a  meaningful  conditional  expectation  interpretation, 
and  if  only  x  is  stochastic  (measured  with  error),  then  the  conditional  expectation 
E[x|y]  is  meaningful,  because  the  two-equation  system  then  reduces  to  x  =  (I  / fi>)  y  — 
u/ fi  +  v.  That  is,  the  reverse  regression  produces  the  least-squares  estimate  (1  />S).  It  is 
straightforward  to  verify  that 


r2JR  =  Pn,  (26-12) 

Pd  <  P  <  Pr* 


where  r2v  is  the  simple  squared  correlation  between  x  and  y;  the  bounds  indicate  that 
fiD  is  a  downward  biased  estimate  of  p  and  fi>K  is  an  upward  biased  estimate.  Note 
that  these  bounds  can  be  very  broad  in  microeconomic  data  where  r2  <  0.5  is  almost 
always  the  case  and  even  r2v  <  0.1  is  quite  common. 

Learner  (1978)  considers  the  model  in  which  (y,  x*)  has  a  bivariate  normal  distri¬ 
bution  with  mean  (fix*,  x*)  and  covariance  matrix 


£  = 


au  +  l 


iv2 


3(r2 


P"* 

ff2>+(Tv.  ' 


(26.13) 


He  shows  (Learner,  1978,  pp.  239-240)  that  the  likelihood  function  for  this  model 
attains  its  maximum  at  any  value  of  fi>  between  the  direct  regression  estimator  fiD  and 
the  reverse  regression  estimator  fiR. 

The  foregoing  analysis  suggests  that  even  though  fi  is  not  identified,  consistent 
bounds  can  be  placed  on  its  value.  This  is  a  potentially  useful  application  of  bounds 
identification.  The  result  can  be  extended  in  a  straightforward  manner  to  the  case  of 
multiple  regression  in  which  only  one  regressor  is  measured  with  error  (Bollinger, 
2003).  Klepper  and  Learner  (1984)  consider  an  extension  to  the  multiple  regression 
case  of  K  regressors,  all  of  which  are  measured  with  error.  There  is  one  direct  re¬ 
gression  and  K  reverse  regressions.  After  estimation  each  reverse  fitted  regression  is 
renormalized  with  a  unit  coefficient  for  y  on  the  left-hand  side.  Then  /3D  is  the  esti¬ 
mated  vector  from  the  direct  regression,  and  /3Rj  (j  =  I .....  /f)  is  the  vector  from 
the  jth  reverse  regression.  By  the  results  of  Klepper  and  Learner  (1984),  if  the  direct 
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and  reverse  regression  coefficient  vectors  are  all  in  the  same  orthant  then  the  set  of 
feasible  values  of  (3  is  the  convex  hull  of  the  direct  and  reverse  regressions;  that  is, 
/3  e  {/3\  (3  =  /,|)/f|)  +  >,i/fR  |  +  •  •  •  +  Xk/3R  K],  where  the  /.-weights  are  nonnegative 
and  sum  to  one.  The  smallest  coefficient  in  the  direct  and  reverse  regression  vectors  is 
the  lower  bound,  and  the  largest  coefficient  is  the  upper  bound.  These  bounds  do  not 
exist  if  the  coefficient  changes  its  sign. 

In  addition  to  the  work  of  Klepper  and  Learner  (1984),  there  are  several  studies 
that  use  these  ideas  in  applied  contexts.  Greene  (1983)  and  Goldberger  (1984)  apply 
reverse  regression  to  measurement  of  salary  discrimination.  Bollinger  (2003)  analyzes 
measurement  of  the  black-white  wage  gap  in  a  model  of  wages  and  human  capital. 
Bollinger  (1996)  applies  the  bounds  approach  to  the  case  of  regression  on  a  categorical 
dummy  variable  in  which  observation  categories  are  misclassified. 

26.3.2.  Identification  Using  Instrumental  Variables 

One  solution  to  the  identification  problem  is  to  introduce  one  or  more  moment  restric¬ 
tions  that  constitute  further  identifying  information.  A  moment  restriction  typically 
states  that  there  is  available  an  instrumental  variable  that  is  correlated  with,  or  causally 
related  to,  the  variable  that  is  measured  with  error.  Moreover,  this  variable  is  uncorre¬ 
lated  with,  or  causally  unconnected  with,  the  outcome  that  is  being  modeled.  Adding 
this  restriction  to  the  original  model  helps  in  principle  to  solve  the  identification 
problem. 

Historically,  the  IV  estimator  was  suggested  as  a  potential  solution  for  the  measure¬ 
ment  error  problem  in  linear  models  (Reierspl,  1941;  Durbin,  1954).  The  IV  approach 
is  similarly  motivated  when  one  or  more  variables  on  the  right-hand  side  are  endoge¬ 
nous  and  hence  correlated  with  the  regression  error.  The  linear  simultaneous  equation 
model  and  the  linear  measurement  error  model  are  isomorphic  and  hence  the  use  of 
IV-type  estimators  in  the  context  of  measurement  errors  is  natural. 

Reconsidering  the  linear  IV  model  of  Sections  4.8  and  6.4,  where  y  =  X/3  +  u 
and  E[u|X]  /  0,  we  can  use  the  2SLS  estimator  if  a  valid  set  of  instruments  Z, 
dim[Z]  >  dim  [X]  is  available. 

One  can  test  for  the  presence  of  measurement  error  using  a  Hausman  test  of  endo¬ 
geneity  of  regressors,  see  Section  8.3.  Several  variants  of  the  test  are  available,  and 
one  variant  was  given  in  Section  8.4. 

A  major  problem  in  implementing  the  IV  estimator  lies  in  the  practical  difficulty 
of  finding  valid  instruments.  Good  instruments  have  two  properties:  zero  correlation 
with  equation  errors  (for  consistency)  and  high  correlation  with  variables  being  in¬ 
strumented  (for  efficiency).  Such  instruments  are  not  typically  easy  to  find.  Although 
ideally  one  should  explicitly  derive  valid  instruments  from  detailed  specification  of 
relationships  between  regressors  and  covariates,  in  practice  ad  hoc  methods  are  com¬ 
mon.  Unlike  the  full  system  specification  approach,  the  ad  hoc  method  is  simpler  and 
less  demanding.  Notice  that  the  conditions  for  the  validity  of  instruments  do  not  create 
an  automatic  procedure  for  selecting  one.  These  technical  conditions  could  be  satisfied 
by  a  variable  that  is  causally  unconnected  with  the  phenomenon  under  study.  One  has 
to  think  of  a  variable  that  correlates  strongly  with  the  regressor(s)  and  is  uncorrelated 
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with  the  equation  error.  A  number  of  interesting  applications  of  this  idea  are  avail¬ 
able  in  the  literature;  see,  for  example,  Angrist  (1990).  If  selected,  the  use  of  such  an 
instrumental  variable  may  be  controversial  and  puzzling. 

We  consider  several  possible  instruments  for  the  cross-section  regression  of  earn¬ 
ings  on  schooling  example.  First,  if  data  are  available  on  siblings  then  the  schooling 
level  of  a  sibling  may  be  used  as  an  instrument,  since  the  education  levels  of  siblings 
are  likely  to  be  correlated.  Consistency  of  the  IV  estimate  then  requires  no  correla¬ 
tion  between  the  measurement  error  v  and  any  measurement  error  in  schooling  of  the 
sibling.  Second,  more  generally  other  variables  related  to  schooling  such  as  parents’ 
educational  level  or  income  may  be  used.  Casting  a  broader  net,  however,  runs  the  risk 
of  leading  to  instruments  that  are  only  weakly  correlated  with  x,  leading  to  imprecision 
and  possible  poor  finite-sample  properties  of  the  IV  estimator.  Third,  more  than  one 
question  on  schooling  level  may  have  been  asked  in  the  survey,  or  schooling  level  may 
be  available  from  surveys  in  other  years  if  data  are  from  a  panel  study.  Such  instru¬ 
ments  are  likely  to  be  highly  correlated  with  x,  but  the  assumption  of  no  correlation  be¬ 
tween  measurement  errors  in  x  and  z  may  be  more  difficult  to  believe  in  this  example. 

Lagged  variables  are  frequently  used  as  instruments,  but  these  too  will  have  mea¬ 
surement  errors,  so  the  approach  is  minimally  satisfactory  only  if  serial  correlation  in 
measurement  error  is  not  a  problem. 

The  effect  of  measurement  error  can  be  large  in  the  panel  context.  Since  panel  data 
provide  measures  of  x*t  in  multiple  periods,  instrumental  variables  estimation  can  be 
used  to  provide  consistent  parameter  estimates  assuming  uncorrelated  measurement 
errors  across  the  time  periods.  See  Hsiao  (1986,  pp.  63-65). 

26.3.3.  Identification  via  Additional  Moment  Restrictions 

Distributional  assumptions  about  the  equation  and  measurement  errors  ( u ,  t>)  can  se¬ 
cure  identification.  There  is  one  important  case  in  which  the  identification  is  aided 
instead  by  information  or  assumption  about  the  distribution  of  the  unobserved  true 
value  of  the  mismeasured  variable.  The  assumption  of  joint  multivariate  normality  of 
(y,x,x*),  together  with  the  assumption  that  the  measurement  error  v  and  equation 
error  u  are,  respectively,  iid  A/"[0,  ct2]  and  iid  Af[0,  cr2],  are  not  sufficient  to  identify 
the  measurement  error  model.  However,  the  assumption  that  the  first  four  moments  of 
(x*,  it.  v )  exist  and  that  the  third  moments  of  each  and  the  third  cross-moments  are  not 
zero,  indicating  a  departure  from  normality,  is  sufficient  to  secure  identification,  as  we 
now  demonstrate. 

Let  us  reconsider  the  model  (26.10) 

y  —  fix*  +  u, 

X  =  X*  +  V, 

whose  reduced  form  y  =  fix  +  s,  where  s  =  u  —  ftv,  is  to  be  estimated  by  an  instru¬ 
mental  variables  procedure.  However,  we  now  add  a  new  piece  of  information:  that  the 
distribution  of  x*  is  nonnormal  in  the  sense  that  it  is  both  skewed  and  has  nonnormal 
(excess)  kurtosis  Cragg  (1997)  Dagenais  and  Dagenais,  1997;  Wansbeek  and  Meijer, 
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2000).  These  assumptions  imply  the  following  six  conditions: 


E  [(xy)  x ]  =  /3E  [x*3] .  E  [(xy)  u]  =  0, 

E  [(x2)  x]  =  E  [x*3]  +  E  [t>3] ,  E  [(x2)  u]  =  ~pE  [u3] , 

E  [(v2)  x]  =  P2E  [x*3]  ,  E  [(y2)  u]  =  ~pE  [e3] . 


The  first  row  implies  that  the  product  variable  x,y,-  is  a  valid  instrument  if  E[x*3  7^ 
0.  The  second  row  implies  that  x2  is  a  valid  instrument  if  E[x*3]  /  0,  but  E[i>3  = 
0;  that  is,  if  x*  is  nonnormal  but  v  has  a  symmetric  distribution.  Indeed,  the  greater 
the  skewness  the  better  is  the  instrument.  However,  because  x*  is  unobservable,  any 
inferences  about  it  will  need  to  be  based  on  x  itself.  The  last  row  implies  that  yf  is  a 
valid  instrument  if  the  third  moment  of  x*  is  nonzero  but  the  third  moment  of  s  is  zero. 

Given  these  moment  conditions,  the  IV  approach  can  be  applied  to  consistently 
estimate  the  model  parameters.  This  example  illustrates  how  additional  moment  as¬ 
sumptions  can  help  generate  useful  instruments  even  when  no  data  other  than  (y,-,  x,) 
are  available. 


26.3.4.  Replicated  Data 

An  alternative  solution  is  possible  if  the  measurement  error  variances  can  be  estimated. 
The  basic  idea  here  is  that  we  can  adjust  the  sample  second-moment  matrix  X'X  of  the 
regressors  by  an  amount  that  depends  on  the  variance  and  covariances  of  measure¬ 
ment  errors.  Notice  that  we  do  not  try  to  adjust  the  observations  themselves.  Instead, 
the  sample  moments  are  adjusted  because  the  estimator  is  a  function  of  those  sample 
moments.  This  key  idea  generalizes  to  more  complex  models  also. 

When  the  measurement  error  variance  Svv  is  known,  a  consistent  estimate  of  / 3  can 
be  obtained  using 

/3  =  (X'X-JVEvvr'X'  y,  (26.14) 

where  N  is  the  sample  size.  This  is  consistent  since 

/3  =  plimliW'X'X  -  Ew)-1  plim (V_1X'y 
=  (Sx»x*  +  svv  —  Svv)  1  SX*X*/3 
=  0. 

where  plim  N~xX'y  =  Sx»x»/3  is  obtained  using  X  =  X*  +  V  and  y  =  X/3  + 
(u  —  V/3).  For  a  detailed  account  of  ways  to  estimate  in  a  substantive  applica¬ 
tion,  see  Krashinsky  (2004). 

Data  replication  is  a  situation  in  which  an  unbiased  estimate  of  the  unobserved  X* 
is  available.  Suppose  that  the  measurement  error  is  additive  and  we  have  an  observable 
X: 


X  =  x*+v. 

If  X  is  an  unbiased  estimate  of  X*,  then  E[V|X*]  =0.  If  data  are  replicated,  this 
simply  means  that  we  have  at  least  two  measurements  available  on  X.  It  also  means  that 
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with  multiple  measurements  we  can  obtain  estimates  of  the  moments  of  V,  assuming 
the  measurement  errors  for  multiple  measures  are  uncorrelated. 

Suppose  there  are  two  scalar  measurements  (replicates)  X(|)  and  X(2),  such  that 
X(J]  =  X*  +  Vu),  7  =  1,2.  Then  V[V(/)]  =  E[X(2y)J  -  F[X(1)X(2)],  which  can  be  esti¬ 
mated  by  the  sample  average  —  X(])jX(2)J  ].  Then  the  regression  param¬ 

eters  can  be  estimated  using  Equation  (26.14). 

For  example,  suppose  we  wish  to  predict  grade  point  average  (GPA)  in  the  first 
year  of  college  using  performance  on  the  SAT  exam  taken  in  high  school.  It  is  known 
that  observed  SAT  scores  for  a  given  person  vary  across  different  takes  of  the  exam. 
Let  x*  denote  the  true  SAT  score,  and  let  x\  and  x2  denote  the  observed  SAT  score 
on  two  separate  SAT  exams.  Then  xi  =  x*  +  V\,  x2  =  x*  +  u2,  and  it  is  assumed  that 
Vi  and  ih  are  independent  with  equal  variance  cr2.  It  follows  that  Cov[xi ,  x2]  =  cr2,, 
V[xi]  =  V[x2]  =  cr2,  +  cr2,  and  Cor2[xi,  x2]  =  cr2,/(cr2,  +  ex2).  Studies  find  the  tests 
to  have  a  reliability  of  0.9,  which  means  that  the  correlation  from  one  test  to  the  next 
is  0.9  and  the  squared  correlation  is  0.8 1 .  Thus  cr2, /(cr2,  +  cr2)  =  0.8 1 .  It  follows  from 
(26.8)  that  plim/l  =  0.81  x  /!,  so  that  because  of  measurement  error  SAT  scores  are 
as  stronger  a  predictor  of  first-year  college  GPA  than  OLS  regression  suggests. 

26.3.5.  Validation  Data 

Sometimes  a  validation  sample  is  also  collected  as  an  additional  check  on  the  origi¬ 
nal  responses.  Although  the  validation  sample  pertains  to  the  population  of  interest, 
it  may  come  from  a  different  independent  source.  For  example,  patients  may  respond 
to  a  questionnaire  about  medical  services  received,  and  providers  of  services  may  re¬ 
spond  to  a  validation  survey.  Another  example  is  that  of  employees  who  may  provide 
some  information  about  an  event,  and  the  information  may  be  validated  by  the  same 
information  obtained  from  the  employers.  A  leading  example  in  economics  is  the  PSID 
validation  study  of  Bound  et  al.  (1994). 

Let  X  be  an  N  x  K  matrix  of  observations  on  regressors  measured  with  error,  and 
let  X,  be  an  M  x  K  matrix  of  validation  data.  We  can  use  validation  data  by  regress¬ 
ing  the  columns  of  X,:  on  X,  and  generating  “predicted”  values  X  [X'X]  1  X'X,.  that 
replace  the  error-contaminated  matrix  X.  For  nonlinear  models  more  complex  proce¬ 
dures  are  used,  see  Lee  and  Sepanski  (1995). 

The  use  of  generated  regressors  that  are  substituted  into  the  regression  of  interest 
can  be  a  practical  useful  strategy  if  the  predictions  come  from  a  well-fitting  regression. 
Generated  regressors  are  estimates  of  the  true  values  and  hence  subject  to  estimation 
uncertainty.  As  such  this  uncertainty  should  be  taken  into  account  in  estimating  the 
sampling  variance  of  the  regression  coefficients.  The  relevant  theory  was  covered  in 
Section  6.8. 


26.4.  Measurement  Errors  in  Nonlinear  Models 

Nonlinear  models,  as  should  by  now  be  abundantly  clear,  comprise  a  bewildering  ar¬ 
ray  of  models.  Obtaining  general  results,  such  as  attenuation  bias,  that  apply  to  a  broad 


911 


MEASUREMENT  ERROR  MODELS 


class  of  models  poses  a  major  challenge.  Not  unusually,  general  results  are  obtained 
under  simplifying  assumptions,  whereas  more  specific  results  can  pay  more  attention 
to  complexity  and  specificity  of  particular  data  situations.  Therefore,  it  is  not  surprising 
that  the  development  of  this  topic  in  the  literature  has  produced  many  procedures  and 
approaches  that  are  specific  to  particular  models.  For  example,  in  dealing  with  binary 
outcome  models  with  left-hand-side  measurement  error  it  is  natural  to  focus  on  the 
problem  of  misclassification;  in  dealing  with  count  models  also  with  left-hand- side 
measurement  error  it  is  equally  natural  to  focus  on  the  issues  of  under-  and  overre¬ 
porting.  Motivated  by  this  difficulty,  Hsiao  (1992)  recommends  shifting  attention  from 
providing  solutions  for  a  general  model  to  a  specific  type  of  question.  In  covering 
model-specific  results,  there  is  a  danger  of  being  compendious  and  of  losing  sight  of 
general  results.  We  therefore  begin  with  some  selected  general  results. 


26.4.1.  Identification  through  Instrumental  Variables 

A  general  technique  in  the  linear  errors-in-variables  model  is  the  instrumental  vari¬ 
ables  method.  For  the  nonlinear  (in  regressors)  regression  model,  Y.  Amemiya  (1985) 
showed  that  the  IV  estimator  is  generally  inconsistent,  being  consistent  only  under  the 
assumption  of  a  shrinking  error  variance-covariance  matrix. 

A  simple  exposition  of  the  aforementioned  point  is  based  on  the  regression  equation 

y  =  Po  +  Pif(x*)  +  e,  (26.15) 

where  f(x*)  is  a  smooth,  differentiable,  and  bounded  function  of  an  error-free  scalar 
regressor  x*.  The  observed  variable  x  =  x*  +  v,  where  v  is  a  measurement  error.  Sub¬ 
stituting  for  x*  and  taking  a  Taylor  expansion  of  f(x  —  i>)  around  x  yields 

OO 

y  =  Po  +  Pi  fix)  +  s  -  Pifm(x)v  +  p1  fU\x)i-v)j/j\,  (26.16) 

j= 2 

where  f^\)  denotes  the  jth  derivative  of  /(•)■  Consider  the  quadratic  case  /Yx)  = 
x2  +  yx,  so  /(1)(x)  =  2x  +  y,  f<2)(x)  =  2,  and  f^\x)  =  0,  j  >  2.  Then 

y  =  Po  +  Pi  ( x 2  +  yx)  +  s  -  /Si  (2x  +  y)  v  +  P\2v2 /2 

—  Po  +  P\x2  +  Pi  yx  +  (e  -  P\xv  -  Piyv  +  P\v2),  (26.17) 

so  valid  instrumental  variables  should  be  correlated  with  x2  and  x,  but  uncorrelated 
with  u  =  (e  —  ft i x v  +  p\y v  +  P\V2).  Clearly  it  is  not  enough  that  v  and  e  are  individ¬ 
ually  uncorrelated  with  the  instruments.  This  means  that  the  instrumental  variable  for 
fix)  has  to  satisfy  more  stringent  properties  than  in  the  linear  case. 

More  generally,  Y.  Amemiya  has  shown,  using  Taylor  approximation,  that  the  in¬ 
strumental  variable  does  not  yield  consistent  estimates  for  nonlinear  errors-in-variables 
models  because  the  residual  term  involves  both  measurement  error  and  an  observed 
error-contaminated  variable.  Therefore  it  is  not  possible  to  find  an  instrumental  vari¬ 
able  that  is  highly  correlated  with  the  observed  variable  but  uncorrelated  with  residual 
term.  Furthermore,  from  a  practical  viewpoint,  it  is  not  easy  to  verify  the  validity  of 
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an  instrumental  variable  in  estimation  because  of  limited  information  about  the  latent 
variable  (. x *)  and  measurement  error. 

26.4.2.  Identification  Using  Replicated  Data 

Faced  with  the  difficulty  of  implementing  an  IV-type  estimation  method,  there  are  two 
alternatives. 

The  first  is  to  make  very  strong  distributional  assumptions  about  the  conditional 
distribution  of  the  unobserved  x*  given  the  observed  x.  Such  assumptions,  augmented 
by  other  technical  conditions,  make  it  possible  to  identify  the  parameters  of  the  model. 
This  approach  has  been  followed  by  Y.  Amemiya  (1985)  and  Hsiao  (1989),  among 
others. 

A  second  approach  is  to  consider  the  possibility  of  having  a  large  number  of  mea¬ 
surements  of  each  unobserved  x*,  denoted  x(j).  Then  the  average  of  the  replicated 
measures  for  each  x*  is  substituted  for  the  unobserved  regressor.  Consistent  estima¬ 
tion  of  the  nonlinear  regression  then  follows  because  the  covariance  matrix  of  mea¬ 
surement  errors  shrinks  to  zero  as  the  number  of  replicates  grows;  see  Y.  Amemiya 
(1985).  Unfortunately,  such  a  scenario  is  rarely  encountered  in  econometrics. 

Since  there  does  not  exist  common  structural  information  in  nonlinear  measurement 
error  models  that  can  be  used  to  identify  and  estimate  regression  models,  we  consider 
some  specific  nonlinear  regression  models. 

Hausman,  Newey,  and  Powell  (1995)  analyze  polynomial  Engel  curves  using  Con¬ 
sumer  Expenditure  Survey  data.  Their  polynomial  function  is  linear  in  parameters. 
They  prove  that,  under  regularity  conditions,  both  an  instrumental  variable  and  an  ad¬ 
ditional  measurement  can  be  used  to  obtain  consistent  and  asymptotically  normally 
distributed  estimates.  In  this  application,  an  adjacent  quarter  is  treated  as  a  replica¬ 
tion  and  an  instrumental  variable.  They  further  propose  that  a  general  nonlinear  func¬ 
tion  can  be  approximated  by  a  polynomial  function.  However,  they  admit  that  the  IV 
method  cannot  be  implemented  in  this  case  and  an  additional  measure  of  true  regres¬ 
sors  is  needed. 

Li  (2002)  proposes  a  general  two-stage  approach  to  the  nonlinear  errors-in- variables 
problem,  which  relies  on  repeated  measurements.  In  the  first  stage,  based  on  empirical 
characteristic  functions  and  the  inverse  Fourier  transform,  a  nonparametric  estima¬ 
tor  is  obtained  for  the  conditional  density  of  the  latent  variables.  With  this  estimator 
available,  a  semiparametric  nonlinear  least-squares  estimator  is  constructed  using  a 
minimum  distance  criterion.  He  establishes  the  estimator’s  consistency.  This  estima¬ 
tor  is  also  robust  in  the  sense  that  it  does  not  require  any  knowledge  of  the  functional 
form  of  the  latent  variables.  Li’s  approach  can  be  applied  to  any  nonlinear  errors-in- 
variables  situation  if  replicated  measurements  are  available.  However,  the  asymptotic 
distribution  of  the  estimator  has  not  been  established. 

26.4.3.  Measurement  Errors  in  Dependent  Variables 

In  a  linear  regression  model  the  measurement  errors  in  the  dependent  variable  inflate 
the  standard  errors  of  regression  parameters  but  do  not  lead  to  inconsistency  of  the 
estimator.  In  a  nonlinear  model  there  are  additional  consequences. 
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One  class  of  applications  has  considered  misclassification  of  responses  in  qualita¬ 
tive  choice  models.  This  has  generated  a  literature  on  reporting  errors. 


Discrete  Choice  Models 

Poterba  and  Summers  (1995),  in  a  study  of  the  effects  of  unemployment  insurance  on 
the  duration  of  unemployment  using  the  CPS  data,  generalize  a  probabilistic  model  to 
allow  for  misclassification  in  labor  market  status  transition.  Specifically,  they  focus  on 
potential  classification  errors  in  three  classes:  employed,  unemployed,  and  not  in  the 
labor  force.  They  develop  a  multinomial  logit  model  with  a  special  feature  of  the  data 
set:  that  all  of  the  individuals  are  assumed  to  correctly  report  as  unemployed  in  the  first 
survey  month.  Their  results  show  that  unemployment  insurance  increases  unemploy¬ 
ment  spells  and  that  correction  for  labor  market  status  misclassification  strengthens  the 
apparent  effect  of  unemployment  insurance  on  spell  durations.  However,  their  model 
is  based  on  an  assumption  that  the  probability  of  reporting  errors  is  fixed  and  uncorre¬ 
lated  with  individual  characteristics,  which,  as  the  authors  agree,  is  “unlikely  to  hold 
in  practice.”  Although  the  authors  claim  that  the  parameter  estimates  are  consistent, 
Hausman,  Abrevaya,  and  Scott-Morton  (1998)  argue  that  the  standard  errors  are  in¬ 
consistently  estimated  because  of  ignorance  of  sampling  variability  of  the  estimated 
error  probability  and  a  non-block-diagonal  form  of  information  matrix. 

Hausman  et  al.  (1998)  propose  a  parametric  method  for  estimating  a  binary  choice 
model  with  misclassification.  However,  their  parametric  method  requires  knowledge 
of  the  error  distribution.  They  emphasize  that  parameter  estimates  may  be  inconsistent 
if  the  distribution  does  not  have  the  assumed  parametric  distribution.  They  further 
introduce  a  two-stage  semiparametric  method.  The  key  condition  in  the  model  for 
identification  is  that  the  expected  value  of  the  observed  dependent  variable  is  an  in¬ 
creasing  function  of  the  underlying  index,  which  they  show  is  weaker  than  the  condi¬ 
tion  for  identification  of  a  parametric  model.  Compared  to  the  approach  of  Poterba  and 
Summers  (1995),  theirs  is  robust  in  the  sense  that  the  misclassification  probability  is  a 
function  of  individual  characteristics.  Using  the  CPS  and  PSID,  they  show  that  serious 
misclassification  exists  in  a  job-change  variable. 

Klein  and  Sherman  (1997)  develop  an  “Orbit  model”  (with  features  of  ordered 
choice  model  and  Tobit  model)  for  the  estimation  of  projected  demand  for  a  poten¬ 
tial  new  video  product.  They  find  evidence  that  potential  consumers  exaggerate  de¬ 
mand.  The  Orbit  model  is  a  two-stage  procedure  with  the  first  stage  estimating  the 
parameters  of  a  standard  Tobit  model  for  actual  future  demand  and  the  second  stage 
estimating  the  mapping  function  between  current  projected  demand  and  actual  future 
demand.  They  further  establish  consistency  and  asymptotic  normality  of  Orbit  esti¬ 
mators.  However,  the  identification  of  the  model  requires  the  assumption  that  the  pro¬ 
jected  zero  demand  will  be  exact  zero  demand  in  future  as  well.  This  may  be  a  strong 
assumption. 

Hsiao  and  Sun  (1999)  use  market  survey  data  on  the  demand  for  an  advanced  elec¬ 
tronic  device.  They  argue  that  respondents  may  report  biased  demands.  They  propose 
a  randomized  response  model  and  a  one-sided  response  bias  model  for  overreporting, 
in  which  different  parametric  probabilities  are  assigned  to  the  truth  and  alternative 
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choices  (including  the  truth)  with  logit  or  probit  density  function  for  the  truly  re¬ 
vealed  preference.  They  find  that  “there  is  a  substantial  response  bias  in  the  data  and 
the  revised  market  take  rates  and  price  elasticities  appear  more  reasonable  than  the 
estimates  obtained  based  on  the  assumption  that  the  respondents  truly  indicate  their 
preference.” 


Count  Regression 

In  the  nonlinear  count  regression  context,  Cameron  and  Trivedi  (1998)  suggest  an 
approach  for  modeling  count  data  subject  to  probabilistic  underrecording.  The  ap¬ 
proach  generates  compound  Poisson  and  negative  binomial  count  models  by  allowing 
for  a  binary  recording  outcome.  Specifically,  for  each  single  occurrence  of  an  event, 
a  Bernoulli  trial  is  used  to  determine  whether  the  event  is  recorded.  Given  a  positive 
probability  that  an  event  may  not  be  recorded,  the  distribution  of  the  recorded  events 
has  a  smaller  mean  and  variance  than  the  distribution  of  the  actual  events.  They  fur¬ 
ther  investigate  estimation  of  the  models  by  ML,  quasi-generalized  pseudo  maximum 
likelihood,  and  moment-based  methods.  Based  on  a  Monte  Carlo  study,  they  find  that 
the  performance  of  the  ML  estimator  is  good  for  samples  of  size  500  or  more. 

Jordan  et  al.  (1997)  give  an  application  of  the  errors-in- variables  method  in  the 
Poisson  regression  model.  In  a  study  of  death  from  stomach  cancer  in  five  Japanese 
counties,  they  notice  that  a  covariate  (e.g.,  plasma  lycopene  level)  is  unknown  and 
is  estimated  from  a  randomly  chosen  collective  and,  therefore,  is  subject  to  sampling 
error.  With  the  assumption  that  the  measurement  error  is  distributed  normally,  they 
implement  a  Bayesian  technique  by  obtaining  the  posterior  distributions  of  the  param¬ 
eters  using  Gibbs  sampling.  The  results  indicate  that  the  corrected  model  gives  more 
accurate  estimates  of  the  parameters  even  when  the  original  sample  is  small. 


26.4.4.  Poisson  Regression  with  Measurement  Errors  in  Covariates 

We  now  consider  in  greater  detail  one  specific  example  of  a  nonlinear  regression  model 
with  additive  measurement  errors  in  covariates.  This  example  illustrates  both  the  con¬ 
sequences  of  such  measurement  errors  and  also  feasible  estimation  strategies. 

Guo  and  Li  (2002)  have  shown  that  measurement  errors  in  covariates  in  general 
lead  to  the  overdispersion  in  the  observed  data.  They  also  show  using  Monte  Carlo 
simulations  that  biases  will  occur  if  the  overdispersion  caused  by  measurement  er¬ 
rors  is  incorrectly  modeled  as  arising  from  unobserved  heterogeneity.  Therefore,  one 
should  not  conclude  from  the  presence  of  overdispersion  that  a  model  with  unobserved 
heterogeneity  is  warranted. 

Stefanski  (1989)  and  Nakamura  (1990)  propose  a  corrected  score  estimator  that 
is  consistent  if  measurement  errors  are  present.  In  particular,  Nakamura  (1990)  gives 
a  closed  form  of  corrected  score  function  when  the  measurement  errors  are  normally 
distributed  and  replicated  data  are  also  available.  By  contrast,  Guo  and  Li  (2002)  have 
generalized  Nakamura  (1990). 
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Measurement  Errors  and  Overdispersion 

In  this  section,  we  consider  the  Poisson  regression  model  in  which  the  discrete  random 
variable  y  follows  the  Poisson  distribution  with  parameter  //  =  exp(x*'/3),  where  /3 
is  a  if  x  1  parameter.  As  is  well  known,  the  Poisson  regression  model  has  an  equi- 
dispersion  property  that 

E[y|x*]  =  V[v|x*].  (26.18) 

If  the  measurement  errors  are  additive,  then 

He  i 

x  =  x  +  e, 


where  e  are  assumed  to  be  independent  of  unobserved  latent  variable  x*,  with  mean 
zero  and  variance-covariance  matrix  £e.  This  notation  covers  the  case  where  all  or 
some  of  the  explanatory  variables  are  measured  with  errors. 

Measurement  errors  increase  dispersion  (see  Chesher,  1991).  This  applies  to  the 
Poisson  regression,  in  the  sense  that  although  (26.18)  holds  for  the  conditional  mean 
and  variance  of  y  given  x*,  conditioning  on  x  changes  the  result.  Instead,  we  get 
E[_y|x]  <  V[y|x],  in  part  because  E[y|x*]  ^  E[y|x],  and  V[_y|x*]  =£  V[_y|x]. 

If  g(x*|x)  denotes  the  conditional  density  of  x*  given  x,  then  Guo  and  Li  show  that 


E[y|  x]  =  J  E[y|x*]g(x*|x)dx* 
=  f  E[y2|x*]g(x*|x)c/x* 


j  (E[y  |  x*  ])2g (x*  |  x)dx* , 


and  using  (26.18)  the  conditional  variance  of  y  given  x  is  given  by 


/ 


V[y|x]  =  /  E[y2|x*]g(x*|x)dx*  - 


J 


E[y|x*]g(x*|x)dx* 


(26.19) 


(26.20) 


A  comparison  of  (26.19)  and  (26.20)  shows  that  the  first  term  inside  the  brackets  of 
(26.19)  is  the  same  as  the  first  term  in  (26.20).  Using  this  Guo  and  Li  show  that 


:/ 


E[y|x*]g(x*|x)dx* 


/ 


(E[y  |  x*  ]  )2  g  (x*  |  x)dx* 


(26.21) 


which  is  interpreted  to  mean  that  measurement  errors  lead  to  overdispersion. 


Estimation  of  Errors-in- Variables  Model 

When  x  are  contaminated  by  measurement  errors  ML  estimation  or  NLS  based  on  the 
observables  (y,  x)  does  not  provide  consistent  estimates.  Replacement  of  covariate  x* 
by  x  in  estimation  is  referred  to  as  a  “naive”  model. 

There  are  two  issues  to  consider.  First,  why  does  ML  give  inconsistent  estimates 
when  measurement  errors  are  present?  Second,  is  consistent  estimation  possible?  The 
answer  to  the  second  question  is  “yes”  if  we  adopt,  following  Stefanski  (1989)  and 
Nakamura  (1990),  the  method  of  corrected  score  estimation  for  the  generalized  linear 
models. 
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The  idea  underlying  the  corrected  score  estimator  is  that  the  conditional  distribution 
of  the  corrected  estimate  with  respect  to  x,  given  the  true  independent  variables  x* 
and  the  dependent  variables  y,  is  centered  around  the  ML  estimate,  which  provides  a 
consistent  estimate  of  the  true  value  of  the  parameter  of  interest. 


Inconsistent  and  Consistent  Estimators 


Suppose  that  N  observations  t  v, ,  x* ),  i  =  1, . . . ,  N,  are  generated  from  a  Poisson  dis¬ 
tribution  with  probability  mass  function 


Pr[K,  =  y,|x*]  = 


Vi'- 


where  /r,(/30)  =  exp(x*'/30).  Given  observations  (y,-,  x*),  i  =  l, . . . ,  N,  the  MLE  (3  is 
consistent  since  the  probability  limit  of  the  average  log-likelihood  function 


plim  iV”1  In L(/3)  =  N~l  +  y,-xf/ 3  -  lny;!}  (26.22) 

i 

=  Ey,x.[-ex*,/3  +  yx*’(3  -  In  v!] 


is  maximized  at  (3  =  (30. 

Suppose  we  observe  x,  rather  than  x*,  where  x,  =  x*  +  e,  and  e,  ~  A/"[0,  Se]  in¬ 
dependent  of  x*.  Then  y,  |  x,  is  not  Poisson  distributed.  If  one  nevertheless  uses  the 
“naive”  Poisson  model,  the  resulting  estimator  (3  maximizes 

Q(J3)  =  N”1  +  y;X-/3  -  In  y, !}.  (26.23) 


This  misspecihed  log-likelihood  function  converges  to 

plim  203)  =  Ey,x, [-ex*'P  +  yx*' (3  -  In  y !]  +  Ex, [-ex*^](E£[e^]  -  1),  (26.24) 

which  in  general  is  not  maximized  at  (3  =  (30.  So  (3  is  inconsistent  for  (30. 

A  suitably  modified  objective  function  yields  consistent  estimates.  Equations 
(26.22)  and  (26.24)  imply  that 

{plim  203)  -  Ex«  [— ex*  ^ ] (Ee [e£  /3]  -  1)}  =  plim  A”1  lnL(/3). 

This  suggests  maximizing  the  objective  function 

2+03)  =  N-1  ~  ln>’<!)  -  E«.[-^](Ee[^]  -  1), 


since  Q+{(3)  converges  to  plim  N  1  In  L(/3).  Now,  given  independence  of  x*  and  e, 

Ex*  [— ex*  ^]Ee[e£  =  E  ,..f[-e(x*+e//9]  =  -Ex[ex'^], 

which  is  consistently  estimated  by  —N~l  JT  ■  It  follows  after  some  cancellation 
that  maximizing  Q  [  (f3)  is  equivalent  to  maximizing 

2++(/3)  =  N~x  ^{y,x'/3  -  In y, !}  -  Ex. [ex*'^].  (26.25) 
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This  yields  a  consistent  estimate  of  (30.  Implementation  requires  a  suitable  estimate 
of  Ex»[ex*,/3],  which  is  possible  if  replicated  data  are  available.  If  the  distribution  of 
explanatory  variables  is  specified  up  to  unknown  parameters,  then  these  unknown  pa¬ 
rameters  can  be  estimated  by  the  replicated  measurements.  Therefore,  Ex*  [ex‘ ]  can 
be  estimated. 

The  estimator  /3C  that  maximizes  (26.25)  is  termed  the  corrected  score  estima¬ 
tor  by  Guo  and  Li  (2002)  because  it  is  the  root  of  the  corrected  score  function 
JT(_y,x,— Ex*[x*ex  ^l)  =  0.  Guo  and  Li  also  establish  the  asymptotic  normality  of 
this  estimator.  The  estimated  asymptotic  covariance  matrix  V[/3C]  =  N  1  /V  1 11A  1 , 
where 

A  =  Ex»  [ex*  ^cx*x*'], 

B  =  N -  Ex* [ex*^cx*])(y(X;  -  Ex,[ex*'^x*])'. 

Nakamura  (1990)  made  the  stronger  assumption  that  the  measurement  errors  e  are 
normally  distributed  as  A/"[0,  f2].  Then 

exp(x*'/3)  =  Ex,x*  [exp  (x'/3  -  (/3'0/3/2))] . 

By  the  law  of  iterated  expectations 

Ex*  [exp(x*'/3)]  =  Ex  [exp  (x'/3  -  Q3'Clf3/2))\ , 

which  can  be  consistently  estimated  by  N~l  JT[exp(x[/3  —  {0 £1(3/2))].  Conse¬ 
quently,  for  Q(j3)  in  (26.23)  the  probability  limit  given  in  (26.24)  reduces  to 

plim  Q(f3 )  =  N^1  ^  [yrtP  -  In  y,- !  -  exp  (x[/3  -  (/3'0/3/2))]  . 

i 

This  is  the  corrected  log-likelihood  function  given  in  Nakamura  (1990).  Maximiza¬ 
tion  with  respect  to  (3  yields  a  consistent  estimate  of  (30. 

Nakamura’s  approach  reminds  one  of  the  estimation  of  the  linear  regression  with 
measurement  errors  (see  (26.14))  given  an  estimate  of  the  covariance  matrix  of  mea¬ 
surement  errors.  As  in  that  case,  to  maximize  Nakamura’s  corrected  log-likelihood 
function  one  requires  knowledge  of  12.  the  covariance  matrix  of  measurement  errors. 
This  can  come  from  replicated  data.  However,  if  the  covariates  are  predominantly  dis¬ 
crete,  then  the  normality  of  measurement  error  is  not  a  sensible  assumption.  In  such 
cases  the  estimator  of  Guo  and  Li  is  more  attractive. 

Lor  the  case  of  multivariate  x*.  the  computation  of  E[exp(x*'/3)]  is  not  straight¬ 
forward,  even  if  the  distribution  of  x*  is  known,  because  multiple  integrals  are  in¬ 
volved.  Simulation-based  methods  (Li,  2002)  provide  one  possible  approach  to  this 
problem. 

Implementation  of  several  other  nonlinear  errors  in  variable  models  also  require 
replicated  observations;  for  example,  see  Hsiao  (1992)  and  Hausman,  Newey,  and 
Powell  (1995).  Panel  data  could  provide  replicated  observations  at  the  level  of  an  indi¬ 
vidual.  Lor  example,  consider  the  case  of  a  scalar  regressor  x*  for  which  two  replica¬ 
tions  of  x  are  available,  because  x;/  =  xr-  +  e,;  for  /  =  l, N  and  j  =  1,2.  Then  a 
moment-based  consistent  estimator  of  err  is  a;  =  Yli(xn  +  xn  ~  2x,  |X,2)/2jV.  Thus 
both  the  mean  and  variance  of  x*  can  be  estimated. 
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26.5.  Attenuation  Bias  Simulation  Examples 

Analytical  results  for  the  linear  model  are  given  in  Section  26.2,  but  results  are  much 
more  difficult  to  obtain  in  nonlinear  models.  Here  we  present  two  simulation  examples, 
one  for  the  logit  model  and  one  for  a  linear-in-logs  model,  that  illustrate  attenuation 
bias  in  nonlinear  regression  with  measurement  error  in  the  regressor. 

In  the  first  example,  the  dgp  is  the  logit  model  with 


y*  =  a*  +  ft*x*  +  e, 


x*  ~  U  [0,  1] ,  e  ~  logistic, 
_  J  0  if  y*  <  0, 

V  “  I  I  if  y*  >  0. 


The  complication  is  that  x*  is  measured  with  error,  so  that 


*  i 

X  =  X  +  V, 

u~Af[0,a2]. 

Since  x*  ~U\ 0,  1]  it  has  variance  cr2,  =  1/12,  and  the  noise-to-signal  ratio  is  y  = 
12cr2.  A  logit  regression  of  y  on  x  rather  than  of  y  on  x*  is  estimated. 

To  conduct  a  simulation  exercise  we  carry  out  a  logit  regression  of  y  on  x,  for  six 
different  values  of  the  noise-to-signal  ratio  including  the  value  of  zero,  which  bench¬ 
marks  the  model.  The  sample  size  is  fixed  at  1,000,  and  100  simulation  replications 
are  used. 

Table  26.1  shows  the  average  values  of  (a,  ft)  in  100  replications,  where  oi  and  ft 
are  the  estimated  intercept  and  slope  from  logit  regression  of  y  on  x,  rather  than  the 
correct  logit  regression  of  y  on  x*,  for  sample  size  N  =1,000  and  for  six  different 
values  of  cr2  leading  to  six  different  noise-to-signal  ratios  s.  The  first  column  with 
y  =  0  benchmarks  the  model.  Recall  that  for  OLS  linear  regression  in  the  same  setup 
the  multiplicative  bias  in  the  slope  coefficient  is  1/(1  +  y),  or  0.96,  0.8,  0.5,  0.2,  and 
0. 1 ,  respectively.  Here  the  biases  have  a  similar  direction,  except  for  logit  regression 
they  are  considerably  larger. 

The  second  example  is  a  bivariate  linear-in-logs  multiplicative  model  with  a  = 
4,  ft  =  0.4,  and  additive  measurement  errors  in  both  variables.  In  this  case  the  setup  is 


Table  26.1.  Attenuation  Bias  in  a  Logit  Regression  with  Measurement  Error 


Noise/Signal 

0 

0.04 

0.25 

1 

4 

9 

Average  oi 

0.785 

1.062 

1.406 

1.548 

1.570 

1.596 

Average  ft 

1.799 

1.224 

0.446 

0.125 

0.037 

0.012 
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Table  26.2.  Attenuation  Bias  in  a  Nonlinear  Regression  with  Additive 
Measurement  Error 


_2  /  2 

0.00025 

0.0025 

0.025 

0.25 

2.5 

25 

Average  /S 

0.393 

0.383 

0.341 

0.217 

0.063 

0.020 

as  follows: 

y*  =  4x*°  Y  u  ~  A/'t  10,  0.0001], 
x*  =  100  +  W[0,  1], 

.V  =  /  +  By,  Sy  ~  Af[0,  <7y  ]  , 

A  =  X*  +  sx,  sx  ~  Afto,  <r*]. 

In  the  simulation  the  sample  size  is  1,000,  and  the  number  of  replications  is  100. 
We  vary  the  value  of  the  variance  of  x*  from  experiment,  to  experiment,  resulting  in 
the  following  values  of  ct~/ax,:  0.001,  0.01,  0.1,  1,  5,  10,  50,  100,  1,000,  and  5,000. 

The  upper  row  of  Table  26.2  gives  the  average  values  of  slope  coefficients  across 
different  experiments  in  which  the  noise-to-signal  ratio  varies.  Once  again  the  attenu¬ 
ation  bias  is  obvious. 

Both  examples  produce  results  that  are  consistent  with  the  hypothesis  underlying 
the  “Iron  Law  of  Econometrics.” 

26.6.  Bibliographic  Notes 

Wansbeek  and  Meijer  (2000)  is  the  most  up  to  date  and  comprehensive  work  on  measure¬ 
ment  errors  written  from  an  econometric  perspective.  It  covers  in  depth  most  of  the  topics  in 
this  chapter,  with  emphasis  on  linear  models.  The  authors  also  include  several  chapters  link¬ 
ing  measurement  error  models  with  factor  models,  latent  variable  models,  and  structural  equa¬ 
tion  models.  In  discussing  results  the  authors  eschew  the  phrase  "it  can  be  shown"  in  favor  of 
deriving  them  in  detail.  Again  from  the  econometric  perspective  Hausman  (2000)  provides  a 
survey  of  the  recent  results  obtained  in  his  and  his  collaborator’s  research.  Bound,  Brown,  and 
Mathiowetz  (2001)  for  a  survey  of  measurement  error  issues  in  labor  markets. 

The  topic  of  measurement  errors  is  well  established  in  the  statistics  literature.  Fuller  (1987) 
is  a  useful  reference;  see,  in  particular,  his  treatment  of  the  orthogonal  regression  approach 
that  is  applicable  when  the  noise-to-signal  ratio  is  known.  Although  our  analysis  of  the  linear 
model  is  very  standard  in  the  econometrics  literature,  the  reader  should  also  be  aware  of  the 
alternative  Berkson  error  model,  in  which  the  unobserved  true  variable  is  assumed  constant 
but  the  imperfectly  measured  variable  is  subject  to  error,  and  the  nonclassical  measurement 
error  model  discussed  in  Angrist  and  Krueger  (1999).  Madansky  (1959)  provides  a  survey  of 
numerous  early  results  and  approaches.  See  also  Stefanski  (2000). 

26.2  Panel  data  models  with  measurement  errors  are  analyzed  in  Bjorn  (1992). 

26.3  The  intriguing  topic  of  reverse  regression  is  analyzed  by  Goldberger  (1984)  and  Greene 
(1983)  in  their  commentary  on  Conway  and  Roberts  (1983).  Learner  (1978)  provides 
an  insightful  discussion  of  reverse  regression  from  a  Bayesian  perspective.  Hahn  and 
Hausman  (2002)  use  the  reverse  regression  idea  to  construct  a  specification  test  for  the 
validity  of  the  IV  approach  to  the  measurement  error  problem.  The  concern  is  that  the 
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available  instruments  may  be  weak,  leading  to  poor  estimates.  The  Hahn-Hausman  idea 
is  to  carry  out  IV  estimation  of  the  direct  regression  in  which  the  mismeasured  variable 
appears  on  the  right-hand  side  of  the  equation.  The  reverse  regression  has  the  same  mis¬ 
measured  variable  on  the  left-hand  side.  This  regression  is  estimated  also  by  instrumental 
variables  using  the  same  instrumental  variables  as  the  direct  regression. 

26.4  The  literature  on  measurement  errors  in  nonlinear  models  is  more  diffuse.  Y.  Amemiya 
(1985)  is  especially  useful  to  econometricians.  From  a  statistical  viewpoint,  Carroll  et  al. 
(1995)  consider  nonlinear  models,  especially  in  the  generalized  linear  class,  with  additive 
measurement  errors  in  regressors,  using  a  variety  of  methods,  including  a  number  that 
can  be  used  if  replicated  data  are  available.  Li,  Trivedi,  and  Guo  (2003)  develop  and 
apply  a  measurement  error  variable  model  in  which  the  counted  response  variable  has 
measurement  error. 


- Exercises - 

26-1  Consider  the  attenuation  bias  result  for  the  slope  parameter  of  the  bivariate 
errors-in-variables  model  (Equation  (26.9)  in  Section  26.2.3).  Extend  the  model 
to  include  an  intercept  term. 

(a)  Derive  a  parallel  result  for  the  measurement  error  bias  of  the  intercept  term. 

(b)  Derive  a  parallel  identification-by-bounds  result  for  the  least-squares  inter¬ 
cept  estimate,  similar  to  Equation  (26.12)  in  Section  26.3.1. 

26-2  (Adapted  from  Bollinger,  2003)  Consider  a  linear  multiple  regression  model 
with  scalar  regressor  xthat  is  measured  with  error  and  a  vector  of  other  regres¬ 
sors  z  that  are  free  of  measurement  error. 

(a)  Maintaining  the  assumptions  regarding  measurement  errors  in  the  bivari¬ 
ate  errors-in-variables  model,  extend  the  attenuation  bias  result  and  the 
identification-by-bounds  result  to  this  case. 

(b)  Check  that  the  new  results  specialize  to  those  for  the  bivariate  case. 

26-3  (Adapted  from  Wansbeek  and  Meijer,  2000)  Consider  the  quadratic  regression 
model  y  =  a  +  px*  +  y x*2  +  e,  where  the  regressor  x*  =  x  +  v,  with  x  observed 
and  va  measurement  error.  Assume  that  (x*,  s,  v)  are  mutually  uncorrelated  and 
normally  distributed  and  that  all  variables  have  zero  mean. 

(a)  Compare  the  bias  of  the  least-squares  estimator  of  p  and  y. 

(b)  Is  the  model  identified?  Compare  the  latter  result  with  that  from  the  bivariate 
linear  errors-in-variable  model. 

26-4  The  literature  on  intergenerational  mobility  uses  the  following  model  (Solon, 
1992;  Zimmerman,  1992): 

Yfon  =  a  +  ft  V'father  +  efon,  (26.26) 

with  Si  ~  iid  A/"[0,  a2}.  Here  Y  is  a  measure  of  permanent  status  (such  as  per¬ 
manent  income)  and  p  measures  the  degree  of  regression  toward  the  mean  in 
economic  status.  Suppose  that  permanent  status  is  not  observed.  Instead,  cur¬ 
rent  status  Yu  is  observed  with  Yit  =  Y,  +  yXit  +  wit,  so  that  Yit  is  composed 
of  an  individual  fixed  effect  V),  referred  to  as  the  permanent  status,  a  system¬ 
atic  factors  X;t,  and  a  transitory  error  component  wjt.  Let  y  denote  the  fitted 
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least-squares  coefficient,  and  let 

Yu  —  y  Xit  =  Yj  +  (y  —  y)  Xu  +  W\ t  =  V/  +  vp . 

(a)  Let  V',ather  =  7-1  J2l=-\  Y^ather  denote  an  average  of  father’s  status  used  as 
the  independent  variable,  a  proxy,  for  the  unobserved  permanent  status  in 
(26.26).  Let  /3avg  denote  the  corresponding  regression  coefficient.  Show  that 
plim  £avg  =  p  PY,  where  PY  =  o$/(o$  +  7"_1  af). 

(b)  Assume  that  the  transitory  component  of  father’s  earnings  follows  an  autore¬ 
gressive  scheme,  v/fther  =  p  v/fther  +  §/f,  where  f,-  ~  AA[0,  a|],  /  =  1 , . . . ,  T. 
Show  that  now  plim  Pavg  =  pPY,  where  P$  =  aY/(crY  +  7"_1  V)  and  V  = 
cr|[T(1  -  p2)]-1[(1  +  2p{T-(1  -pr)/(1  -p)}/T(1  -p)]. 
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Missing  Data  and  Imputation 


27.1.  Introduction 

The  problem  of  missing  data  in  survey  data  is  one  of  long  standing,  arising  from 
nonresponse  or  partial  response  to  survey  questions.  Reasons  for  nonresponse  include 
unwillingness  to  provide  the  information  asked  for,  difficulty  of  recall  of  events  that 
occurred  in  the  past,  and  not  knowing  the  correct  response.  Imputation  is  the  process 
of  estimating  or  predicting  the  missing  observations. 

In  this  chapter  we  deal  with  the  regression  setup  with  data  vector  (y,-,x,-),  i  = 
1, . . . ,  N.  For  some  of  the  observations  some  elements  of  x,  or  of  both  (y,  ,  x,)  are 
missing.  A  number  of  questions  are  considered.  When  can  we  proceed  with  an  anal¬ 
ysis  of  only  the  complete  observations,  and  when  should  we  attempt  to  fill  the  gaps 
left  by  the  missing  observations?  What  methods  of  imputation  are  available?  When 
imputed  values  for  missing  observations  are  obtained,  how  should  estimation  and  in¬ 
ference  then  proceed? 

If  a  data  set  has  missing  observations,  and  if  these  gaps  can  be  filled  by  a  statistically 
sound  procedure,  then  benefit  comes  from  a  larger  and  possibly  more  representative 
sample  and,  under  ideal  circumstances,  more  precise  inference.  The  cost  of  estimating 
missing  data  comes  from  having  to  make  (possibly  wrong)  assumptions  to  support  a 
procedure  for  generating  proxies  for  the  missing  observations,  and  from  the  approxi¬ 
mation  error  inherent  in  any  such  procedure.  Further,  statistical  inference  that  follows 
data  augmentation  after  imputed  values  replace  missing  data  is  more  complicated  be¬ 
cause  such  inference  must  take  into  account  the  approximation  errors  introduced  by 
imputation. 

Gaps  in  data  as  the  result  of  survey  nonresponse  and  attrition  from  panels  occur 
frequently.  Imputation  of  missing  values  may  be  done  by  agencies  for  creating  and 
maintaining  the  public-use  survey  databases  or  by  those  who  use  the  data  for  model¬ 
ing.  In  the  former  case  the  agency  may  have  more  extensive  information,  including 
confidential  information,  that  can  be  harnessed  in  the  imputation  process.  In  the  latter 
case  the  modeler  may  have  a  specific  modeling  framework  that  can  be  exploited  in  the 
imputation  process.  In  both  cases  model-based  imputation  procedures  are  feasible. 
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B:  Special  Pattern  of  missing  data  on  x 1  and  x2 


C:  General  pattern  of  missing  data 
Figure  27.1:  Missing  data:  examples  of  missing  regressors. 


An  interesting  example  of  missing  data  arises  in  the  context  of  the  Survey  of  Con¬ 
sumer  Finances  (Kennickell.  1998).  Because  of  the  sensitivity  of  the  issue  of  consumer 
finances  the  survey  exhibits  many  gaps  in  information  on  income  and  wealth.  Analysts 
at  the  U.S.  Federal  Reserve  have  developed  and  implemented  complex  imputation  al¬ 
gorithms  for  continuous  and  discrete  variables  using  both  publicly  available  survey  in¬ 
formation  on  income  and  wealth  as  well  as  confidential  information  from  census  data. 

Figure  27.1  shows  some  potential  patterns  of  missing  data  on  the  regressors.  The 
data  set  has  a  scalar  dependent  variable  y  and  three  regressors:  X\,  x2,  and  x2  for  each 
observation,  then  stacked  as  (y,  xx,  x2,  X3).  In  panel  A,  there  are  complete  data  on 
(y,  x2,  X3)  but  a  block  of  observations  on  xi  are  missing.  In  panel  B  there  are  complete 
data  on  (y,  x3)  but  there  are  missing  blocks  of  data  on  (x, ,  x2)  such  that  xi  and  x2 
are  never  simultaneously  observed.  In  panel  C  there  is  a  general  pattern  of  missing 
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observations  with  missing  observations  on  all  three  regressors,  but  there  is  no  particular 
pattern  of  missingness. 

The  simplest  way  of  handling  missing  data  is  to  delete  them  and  analyze  only  the 
reduced  sample  of  “complete”  observations.  For  example,  in  the  case  of  panel  A,  the 
complete  sample  would  be  the  subset  of  (y,  x, ,  x2,  X3)  formed  by  all  available  data  on 
X]  and  the  corresponding  observations  on  (y,  x2,  X3).  In  the  case  of  panel  B,  however, 
following  this  approach  one  would  leave  no  usable  observations,  unless  one  excluded 
(xi ,  x2)  from  the  analysis.  In  panel  C  the  complete  data  set  is  formed  after  deleting  any 
observation  that  contains  a  missing  data  point  on  any  of  the  three  regressors. 

The  procedure  just  described  is  called  listwise  deletion.  It  is  widely  followed  and  is 
often  a  default  option  in  statistical  software.  It  is  not  necessarily  innocuous;  the  conse¬ 
quences  depend  on  the  missing  data  mechanism,  and  the  conclusions  drawn  from  such 
studies  might  be  seriously  flawed.  Of  course,  in  general  throwing  away  data  means 
throwing  away  information,  and  that  reduces  efficiency  in  estimation.  Hence,  provided 
the  gaps  attributed  to  missing  data  can  be  filled  without  creating  distortion,  listwise 
deletion  seems  worth  trying.  This  chapter  will  study  alternative  approaches  and  their 
limitations. 

Broadly,  there  are  two  approaches  to  imputation,  one  that  is  model-based  and  one 
that  is  not.  The  modern  approach  prefers  model-based  approaches.  These  use  a  model 
to  impute  the  missing  observations  and  then  use  the  subsequent  full  data  set  to  obtain 
better  estimates  of  the  model  parameters.  The  process  is  iterative.  Single  and  multiple 
imputation  are  feasible.  A  key  feature  of  the  modern  approach  is  to  regard  missing 
data  as  random  variables  and  then  to  replace  them  with  multiple  draws  from  the  as¬ 
sumed  underlying  distribution;  the  process  is  called  multiple  imputation.  Simulation 
methods  may  be  used  to  approximate  such  a  distribution. 

This  topic  warrants  a  separate  short  introductory  chapter  as  imputation  is  an  impor¬ 
tant  aspect  of  microeconometric  work.  Survey  data  inevitably  include  missing  data, 
and  the  common  practice  of  listwise  deletion  is  an  imputation  method.  Better  im¬ 
putation  methods  are  available.  An  important  caveat,  however,  is  that  all  imputation 
methods  are  based  on  assumptions  that  in  some  applications  may  be  too  strong. 

Most  of  the  chapter  deals  with  model-based  approaches.  Section  27.2  provides  an 
introduction  to  the  terminology  and  assumptions  that  are  firmly  entrenched  in  the  im¬ 
putation  literature.  Section  27.3  gives  a  brief  treatment  of  missing  data  methods  that 
do  not  use  models.  Section  27.4  begins  with  the  first  of  the  model-based  methods, 
maximum  likelihood.  Section  27.5  considers  the  regression  framework  and  EM-type 
methods  of  imputation.  Sections  27.6  and  27.7  present  approaches  to  imputation  us¬ 
ing  the  Bayesian  concepts  of  data  augmentation  and  MCMC.  Section  27.8  provides 
an  illustrative  example.  Sections  27.6-27.8  provide  a  nice  application  of  the  Bayesian 
methods  of  Chapter  13. 


27.2.  Missing  Data  Assumptions 

Some  of  the  basic  terminology  and  formal  definitions  widely  used  in  the  impu¬ 
tation  literature  are  due  to  Rubin  (1976),  who  introduced  two  key  missing  data 
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mechanisms,  missing  at  random  and  missing  completely  at  random,  that  serve  as  useful 
benchmarks. 

Rubin’s  setup  involves  Y,  an  N  x  p  matrix  consisting  of  a  complete  data  set, 
which  may  not  be  fully  observed.  Denote  by  Yobs  the  observed  part  and  by  Ymis  the 
nonobserved  (missing)  part.  In  the  context  of  a  regression  model  Y  refers  to  both 
the  regressors  and  the  response  (dependent)  variables.  Therefore,  the  analysis  covers 
the  general  case  of  missing  data.  Let  R  denote  an  N  x  p  matrix  of  indicator  variables 
whose  elements  are  zero  or  one  depending  on  whether  corresponding  values  in  the  Y 
matrix  are  missing  or  observed. 

For  regression  with  single  dependent  variable,  Y  contains  data  on  the  response  vari¬ 
able  y  and  the  (p  —  1)  regressors  X .  The  probability  that  xu,  the  ith  observation  on 
variable  jc*,  is  missing  may  be  (i)  independent  of  its  realized  value,  (ii)  dependent  on 
its  realized  value,  (iii)  dependent  on  Xkj ,  j  0  i,  or  (iv)  dependent  on  xij ,  j  0  i,  l  0  k. 

Assumptions  about  the  structure  of  missingness  follow. 


27.2.1.  Missing  at  Random 

Suppose  Xi  (i  =  1, . . . ,  N)  is  an  observation  on  a  variable  in  the  data  set  under  study. 
The  missing  at  random  (MAR)  assumption  is  that  the  “missingness”  in  x,  does  not 
depend  on  its  value  but  may  depend  on  the  values  of  Xj  (  j  0  i ).  Formally, 

Xi  is  MAR  =>■  Pr[x,  is  missing  |  x,-,  xj  V  j  0  i  ]  (27.1) 

=  Pr[x,  is  missing  |  xj  V  j  0  /  ]. 

After  controlling  for  other  observations  on  x,  the  probability  of  missingness  of  x,  is 
unrelated  to  the  value  of  x, . 

Rubin’s  (1976)  even  more  formal  definition  states  the  following:  The  MAR  assump¬ 
tion  implies  that  the  probability  model  for  the  indicator  variable  R  does  not  depend  on 

Ymjs,  that  is, 


Pr  [R  |  Yobs,  Ymis,  ]  =  Pr[R  |  Yobs,  -0  ] , 

where  0  is  the  underlying  (vector)  parameter  of  the  missingness  mechanism. 

Under  MAR  no  nonresponse  bias  is  induced  in  a  likelihood-based  inference  that 
ignores  the  missing  data  mechanism,  although  the  resulting  estimates  may  be  in¬ 
efficient.  If  the  MAR  assumption  fails,  however,  the  probability  of  missingness 
depends  on  the  unobserved  missing  values.  The  MAR  restriction  is  not  testable 
because  the  values  of  the  missing  data  are  unknown.  Because  MAR  is  a  strong  as¬ 
sumption,  sensitivity  analyses  based  on  different  assumptions  about  missingness  are 
desirable. 

A  separate  issue  is  whether  the  pattern  of  missing  data  is  purely  random.  In  prac¬ 
tice,  we  might  expect  that  observations  missing  inside  clusters  of  data,  in  the  sense  of 
Chapter  24,  may  be  correlated.  However,  this  issue  is  not  related  to  that  of  nonresponse 
bias  resulting  from  the  missingness  being  connected  to  data  values. 
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27.2.2.  Missing  Completely  at  Random 

Missing  completely  at  random  (MCAR)  is  a  special  case  of  MAR.  It  means  that  Y0bs 
is  a  simple  random  sample  of  all  potentially  observable  data  values  (Schafer,  1997). 

Again  suppose  x,  is  an  observation  on  a  variable  in  the  data  set  under  study.  Then 
the  data  on  x,-  is  said  to  be  MCAR  if  the  probability  of  missing  data  on  x,  depends 
neither  on  its  own  values  nor  on  the  values  of  other  variables  in  the  data  set. 
Formally, 


Xj  is  MCAR  =>■  Pr[x;  is  missing  |  x,-,  x;  V  j  ^  i  ]  (27.2) 

=  Pr[x,  is  missing]. 

For  example,  MCAR  is  violated  if  (a)  those  who  do  not  report  income  are  younger,  on 
average,  than  those  who  do  or  if  (b)  typically  small  (large)  values  are  missing. 

For  cases  (i)-(iv)  mentioned  at  the  outset  in  this  section,  case  (i)  satisfies  both 
MCAR  and  MAR,  cases  (iii)  and  (iv)  satisfy  MAR,  and  (ii)  does  not  satisfy  MAR. 

MCAR  implies  that  the  observed  data  are  a  random  subsample  of  the  potential  full 
sample.  If  the  assumptions  were  valid  no  biases  would  result  from  ignoring  incomplete 
observations,  that  is,  observations  with  missing  values. 

The  corollary  is  that  the  failure  of  MCAR  implies  a  sample  selection  type  of  bias. 
MAR  is  a  weaker  assumption  that  still  aids  imputation  as  it  assumes  that  the  missing 
data  mechanism  depends  only  on  observed  quantities. 


27.2.3.  Ignorable  and  Nonignorable  Missingness 

A  missing  data  mechanism  is  said  to  be  ignorable  if  (a)  the  data  set  is  MAR  and  (b)  the 
parameters  for  the  missing  data-generating  process,  ip,  are  unrelated  to  the  parameters 
9  that  we  want  to  estimate. 

This  condition,  which  is  similar  to  that  of  weak  exogeneity  discussed  in  Chapter  2, 
implies  that  the  parameters  9  of  the  model  are  distinct  from  parameters  ip  of  the  miss¬ 
ingness  mechanism.  Thus,  if  the  missing  data  are  ignorable,  then  there  is  no  need  to 
model  the  dgp  for  missing  data  as  an  essential  part  of  the  modeling  exercise.  MAR  and 
“ignorability”  are  often  treated  as  equivalent  under  the  assumption  that  condition  (b) 
for  ignorability  is  almost  always  satisfied  (Allison,  2002). 

A  nonignorable  missing  data  mechanism  arises  if  the  MAR  assumption  is  violated 
for  ( v,  x),  but  it  would  not  be  violated  if  MAR  is  violated  only  for  x.  In  that  case 
the  dgp  for  missing  data  must  be  modeled  along  with  the  overall  model  to  obtain 
consistent  estimates  of  the  parameters  9.  To  avoid  the  possibility  of  selection  bias, 
estimators  such  as  Heckman’s  two-stage  procedure  (see  Chapter  16)  must  be  used. 

The  imputation  literature  focuses  on  ignorable  missingness.  If  additionally  the  data 
set  is  MCAR  then  missing  data  cause  no  problem,  aside  from  efficiency  loss  that  might 
be  reduced  by  imputation.  If  instead  the  data  set  is  only  MAR  then  imputation  methods 
may  be  needed  to  ensure  consistency,  as  well  as  to  increase  efficiency. 
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27.3.  Handling  Missing  Data  without  Models 

If  no  models  are  to  be  used,  then  one  can  simply  analyze  the  available  data  or  one  can 
analyze  data  after  non-model-based  imputation. 

27.3.1.  Using  Available  Data  Only 

Listwise  deletion  or  complete  case  analysis  means  the  deletion  of  the  observations 
(cases)  that  have  missing  values  on  one  or  more  of  the  variables  in  the  data  set.  Under 
the  MCAR  assumption,  the  remaining  sample  after  listwise  deletion  remains  a  random 
sample  from  the  original  population;  therefore  the  estimates  based  on  it  are  consistent. 
However,  the  standard  errors  will  be  inflated  because  less  information  is  used.  If  the 
number  of  regressors  is  large,  then  the  total  effect  of  listwise  deletion  can  lead  to  very 
substantial  reduction  in  the  total  number  of  observations.  This  might  encourage  one  to 
leave  out  of  the  analysis  variables  with  a  high  proportion  of  missing  observations,  but 
the  results  generated  by  such  practice  are  potentially  misleading. 

If  MCAR  is  not  satisfied  and  the  missing  data  are  only  MAR,  then  the  estimates  will 
be  biased.  Thus  listwise  deletion  is  not  robust  to  the  violations  of  MCAR.  However, 
listwise  deletion  is  robust  to  the  violations  of  MAR  among  the  independent  variables 
(regressors)  in  regression  analysis,  that  is,  when  the  probability  of  missing  data  on  any 
regressor  does  not  depend  on  the  values  of  the  dependent  variable.  Briefly,  listwise 
deletion  is  acceptable  if  incomplete  cases  attributable  to  missing  data  comprise  a  small 
percentage,  say  5%  or  less,  of  the  number  of  total  cases  (Schafer,  1996).  It  is  important 
that  the  sample  after  listwise  deletion  is  representative  of  the  population  under  study. 

Pairwise  deletion  or  available-case  analysis  is  often  considered  a  better  method 
than  listwise  deletion.  The  idea  here  is  to  use  all  possible  pairs  of  observations  (jcj,-  ,  *2 ,) 
in  estimating  joint  sample  moments  of  (at,  x{)  and  to  use  all  observations  on  an  indi¬ 
vidual  variable  in  estimating  marginal  moments.  Thus,  in  a  linear  regression,  under 
pairwise  deletion  we  would  estimate  (X'X)  and  (X'y)  using  all  possible  pairs  of  re¬ 
gressors,  whereas  under  listwise  deletion  we  would  estimate  the  same  after  deleting 
all  cases  with  any  missing  observations.  It  is  clear  that  we  lose  less  information  un¬ 
der  pairwise  deletion.  The  proposal  here  is  to  use  maximum  information  to  estimate 
individual  summary  statistics  such  as  means  and  covariances  and  then  to  use  these 
summary  statistics  to  compute  the  regression  estimates. 

There  are  two  important  limitations  of  pairwise  deletion:  (1)  Conventionally  es¬ 
timated  standard  errors  and  test  statistics  are  biased  and  (2)  the  resulting  regressor 
covariance  matrix  (X'X)  may  not  be  positive  definite. 

27.3.2.  Imputation  without  Models 

There  are  a  number  of  ad  hoc  or  weakly  justified  procedures  often  implemented  in 
statistical  software. 

Mean  imputation  or  mean  substitution  involves  replacing  missing  observations 
by  the  average  of  the  available  values.  It  is  mean-preserving  but  will  have  impact  on  the 
marginal  distribution  of  the  data.  It  is  obvious  that  the  probability  mass  in  the  center 
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of  the  marginal  distribution  will  increase.  It  will  also  affect  the  covariances  and  corre¬ 
lations  with  other  variables. 

Simple  hot  deck  imputation  involves  replacement  of  the  missing  value  by  a  ran¬ 
domly  drawn  value  from  the  available  observed  values  of  that  variable,  somewhat  like 
a  bootstrap  procedure.  It  preserves  the  marginal  distribution  of  the  variable,  but  it  dis¬ 
torts  the  covariances  and  correlations  between  variables. 

In  a  regression  setting  neither  of  these  two  well-known  approaches  are  attractive 
despite  their  simplicity. 


27.4.  Observed-Data  Likelihood 

The  modern  approach  to  missing  data  is  to  impute  values  for  missing  observations  by 
making  single  or  multiple  draws  from  the  estimated  distribution  based  on  the  pos¬ 
tulated  observed  data  model  and  the  model  for  the  missing  data  mechanism.  The 
Bayesian  variants  of  this  procedure  make  the  draws  from  the  posterior  distribution, 
which  uses  both  the  likelihood  and  the  prior  distribution  of  the  parameters. 

The  first  important  issue  involves  the  role  played  by  the  missing  data  mechanism 
in  the  imputation  procedure  and  especially  whether  the  missing  data  mechanism  is 
ignorable. 

Let  6  denote  the  parameters  of  the  dgp  for  Y  =  (Yobs,  Ym;s)  and  let  ip  denote  the 
parameters  of  the  missing  data  mechanism.  For  convenience  of  notation  it  is  assumed 
that  (Yobs,  Ymis)  are  continuous  variables.  Then  the  joint  distribution  of  (R,  Yobs)  is 
given  by 

Pr  [R,  Yobs|0,  ip]  =  J  Pr  [R,  Yobs,  Ymis|  6 ,  i/>]  dYmis  (27.3) 

=  J  Pr  [R|  Yobs,  Ymis,  ip]  Pr  [Yobs,  Ymis|6>]  dYmis 

=  Pr  [R|  Yobs,  i/,]  j  Pr  [Yobs,  Ymls|01  dYmis 
=  Pr  [R|  Yobs,  ip]  Pr  [Yobs|0] . 

The  first  equality  derives  the  joint  probability  of  (R,  Yobs)  by  integrating  out  (or  aver¬ 
aging  over)  Ym;s  from  the  joint  probability  of  all  data  and  R.  The  second  line  factors 
the  joint  probability  into  conditional  and  marginal  components,  the  conditioning  being 
with  respect  to  Yobs  and  Ym;s.  The  third  line  separates  the  missing  data  mechanism 
from  the  observed  data  mechanism;  this  step  is  justified  by  the  MAR  assumption.  The 
last  line  means  that  6  and  ip  are  distinct  parameters  and  hence  inference  about  6  can 
ignore  the  missing  data  mechanism  and  depends  on  Yobs  alone. 

The  observed-data  likelihood  is  proportional  to  the  last  factor  in  the  fourth  line: 

L[0|Yobs]cxPr[Yobs|01.  (27.4) 

It  involves  only  the  observed  data  Yobs  even  though  the  parameters  6  appear  in  the 
dgp  for  all  observations  (observed  and  missing).  As  in  Chapter  13,  the  constant  of 
proportionality  does  not  appear  in  (27.4). 


929 


MISSING  DATA  AND  IMPUTATION 


Under  the  MAR  assumption  the  joint  posterior  probability  of  (9,  ip)  is  written  as 
the  product  of  Pr  [R,  Yobs|0,  ip  ]  and  the  joint  prior  distribution  n(9.  ip)  as  follows: 

Pr[0,  i/>|Yobs,  R]  =  A'Pr[R,  Yobs|0,  iP]it{9,  iP)  (27.5) 

oc  Pr  [R|Yobs,  ip  ]  Pr  [Yobs|0]  jt(9,  0) 
cx  Pr  [R|Yobs,  iP  ]  Pr  [Yobs|0]  ne{9)  tv^iP), 

where  k  in  the  first  line  is  a  constant  of  proportionality  free  of  ( 9 ,  ip).  The  second 
line  uses  the  factorization  given  in  (27.3),  and  the  third  line  uses  the  assumption  of 
independent  priors  for  9  and  ip. 

As  our  main  interest  is  in  9,  we  derive  the  marginal  posterior  for  6  by  integrating 
out  ip  from  the  joint  posterior.  This  yields  the  observed-data  posterior 

Pr[0|Yobs,  R )  =  J  Pr [6.  0|Yobs,  R]d0  (27.6) 

cx  Pr[Yobs|  9  ]7Te(6)  J  Pr[R|Yobs,  ip  ]ir^(ip)dip 
cx  L[0|Yobs];re(0), 

where  the  second  line  separates  9  and  ip,  and  the  last  line  absorbs  the  integral  expres¬ 
sion  into  the  constant  of  proportionality.  Therefore,  the  last  line  does  not  involve  ip 
and  is  independent  of  the  missing  data  mechanism  R. 


27.5.  Regression-Based  Imputation 

In  this  section  we  consider  a  least-squares  based  imputation.  The  key  component  is 
use  of  the  EM  algorithm,  previously  introduced  and  discussed  in  Section  10.3.7. 

The  EM  algorithm  consists  of  the  expectation  step  and  the  maximization  step.  The 
structure  of  the  EM  algorithm  is  closely  related  to  Bayesian  MCMC  and  data  aug¬ 
mentation  methods.  Therefore,  rather  than  providing  a  fully  operational  method  for 
handling  missing  data,  we  will  introduce  an  example  that  brings  out  the  motivation  be¬ 
hind  modem  multiple  imputation  techniques  and  suggests  the  major  features  of  such 
an  approach. 


27.5.1.  Linear  Regression  Example  with  Missing  Data 
on  a  Dependent  Variable 

In  practice  one  can  have  missing  observations  on  dependent  (endogenous)  variables 
and/or  explanatory  variables.  We  consider  a  regression  example  that  has  missing  data 
on  the  dependent  variable,  with 
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where  E[u|X]  =  0  and  E[uu'|X]  =  u2\N.  The  complication  is  that  a  block  of  obser¬ 
vations  on  the  dependent  variable  y,  denoted  vm;s,  is  missing.  We  assume  that  the 
available  complete  observations  are  a  random  sample  from  the  population,  so  that  the 
missing  data  are  assumed  to  be  MAR  though  not  MCAR. 

Given  the  MAR  assumption  and  N\  >  K,  the  first  block  of  N\  observations  can  be 
used  to  consistently  estimate  the  K -dimensional  parameter  f3  and  a2.  The  maximum 
likelihood  estimates  of  (]3,  cr2)  under  Gaussian  errors  are  (3  =  [XjXi]_1Xjyi  and  s2  = 
(y i  —  X  i  /3)'ty  i  —  X\(3)/N\.  By  standard  theory,  and  under  the  normality  assumption, 
/3 1 data  ~  A/'[/3,cr2[X/1Xi ] ^ 1  ]  and  s2/o2\(3  ~  (Afi  -  K)xj,l_K. 

First,  consider  a  naive  single-imputation  procedure  for  generating  the  missing  ob¬ 
servations.  Conditional  on  X2,  the  predicted  values  of  ym;s,  denoted  ymjs,  are  given  by 
X2/3,  where  [3  is  the  preceding  estimate  obtained  using  only  the  first  N\  observations. 
Then 


E[ymis|X2]  =ymis  =  X23,  (27.8) 

V  [ymis]  s  V[y|X2]  =  s2(\Nl  +  X2  [X^f1  X'2), 

where  .v2I;y,  is  an  estimate  of  V[U2]. 

In  the  naive  method  one  would  generate  the  AA  predicted  values  of  ym;s,  and  then 
apply  standard  regression  methods  to  the  full  sample  of  N  =  N\  +  N2  observations. 

The  two  steps  in  the  naive  method  correspond  to  the  two  steps  of  the  EM  algorithm. 
The  prediction  step  is  the  E-step,  and  the  second-step  application  of  least  squares  to 
the  augmented  sample  is  the  M-step. 

However,  this  solution  has  flaws.  First,  consider  the  data  augmentation  step.  Be¬ 
cause  the  generated  values  ym;s  lie  exactly  on  the  least-squares  fitted  plane,  the  addi¬ 
tion  of  (ymis,  X2)  to  the  sample  to  produce  a  new  estimate,  [3A ,  does  not  change  the 
previous  estimate  (3\ 

pA  =  [x'jX,  +  X'  x2]_I  [x;yi  +  x;ymis] 

=  [XjXi  -I-XjXi]-1  [X'iXj+X'jX,^] 

=  3. 

Second,  the  estimate  of  a 2  obtained  by  the  standard  formula  to  the  residuals  from 
the  augmented  sample  yields  an  estimate  that  is  too  small  because  the  additional  AA 
residuals  are  zero  by  construction, 

s\  =  (y  -  x3A)'(y  -  Xjaj/A  (27.9) 

=  (yj  -  X3)'(ji  ~  Xj P)/N  <  s2 


where  s2  correctly  divides  by  N\  rather  than  N . 

Finally,  as  can  be  seen  from  the  expression  for  the  sampling  variance  of  ym;s,  the 
generated  predictions  are  heteroskedastic,  unlike  the  yi ,  and  hence  the  variance  of  (3A 
cannot  be  estimated  using  the  least-squares  formula  in  the  usual  way.  The  observations 
ym;s  are  draws  from  a  distribution  with  a  different  variance.  The  naive  method  does  not 
make  allowance  for  the  uncertainty  attached  to  the  estimates  of  ym;s. 
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To  fix  these  problems  modifications  are  needed.  First,  the  estimation  of  ymjs  should 
take  account  of  uncertainty  regarding  (3.  This  may  be  done  by  adjusting  ymls  and 
adding  some  “noise”  to  the  generated  predictions  such  that  the  estimates  of  missing 
data  more  closely  mimic  a  draw  from  the  (estimated  or  conditional)  distribution  of 
yi.  A  standardization  step  can  use  the  fact  that  an  estimate  of  V[ym;s] ,  V,  is  avail¬ 
able  from  (27.8).  Hence  the  components  of  the  transformed  variable  V  l/2ymis  have 
unit  variance.  To  mimic  the  distribution  of  yi,we  can  make  a  Monte  Carlo  draw  from 
Af[0,  s1  ]  distribution  and  multiply  it  by  V  l/2ymis. 

The  revised  algorithm  is  as  follows. 

1.  Estimate  f3  using  the  /V[  complete  observations  as  before. 

2.  Generate  ymis  =  X2/32. 

3.  Generate  adjusted  values  of  yj‘nls  =  (V~  1/2ymis)  O  u,„  of  ylms,  where  u,„  is  a  Monte  Carlo 
draw  from  the  J\f  [0,  .v2]  distribution  and  ©  denotes  element-by-element  multiplication. 

4.  Using  the  augmented  sample  obtain  a  revised  estimate  of  (3. 

5.  Repeat  steps  1-4  where  in  step  1  the  revised  estimate  of  f3  is  used. 

The  revised  algorithm,  also  an  EM-type  algorithm,  continues  until  it  converges  in 
the  sense  that  the  changes  in  the  coefficients  or  the  changes  in  regression  residual  sum 
of  squares  become  arbitrarily  small. 

To  make  connection  with  later  discussion  we  give  the  algorithm  a  different  interpre¬ 
tation.  Step  3  is  a  draw  from  the  conditional  distribution  of  y  given  (3,  and  step  4  is  a 
draw  from  the  conditional  distribution  of  / 3  given  .v2.  X.  The  approach  may  be  refined 
further  by  adding  a  step  that  involves  a  draw  from  the  distribution  of  s2.  We  do  not 
go  through  all  the  steps  of  this  approach  because  they  will  become  clearer  in  our  later 
discussion  of  imputation. 

Alternative  models  for  missing  data  on  the  dependent  variable  were  presented  in 
Chapter  16.  These  relaxed  the  MAR  assumption  and  specified  nonignorable  missing¬ 
ness.  Then  the  preceding  EM  algorithm  leads  to  inconsistent  estimation  of  (3.  The  cen¬ 
sored  Tobit  model  specifies  that  data  are  missing  for  observations  with  x'(3  +  u  <  0 
and  a  consistent  estimator  is  the  Tobit  MLE  (see  Section  16.3).  Amemiya  (1985, 
pp.  376-378)  details  the  EM  algorithm  for  the  Tobit  model. 


27.6.  Data  Augmentation  and  MCMC 

The  general  structure  of  the  Bayesian  approach  to  missing  data  is  to  use  the  following 
type  of  iterative  algorithm  that  uses  imputation  and  prediction  steps. 

The  imputation  step  (I-step)  makes  a  draw  from  the  conditional  predictive  distri¬ 
bution  of  Ymis.  Given  an  rth  round  estimate, 

Pr[Ymis|Yobs,  0° *].  (27.10) 

This  expression  denotes  a  random  draw  of  Yj^J1'  from  the  predictive  conditional  dis¬ 
tribution  of  Ym;s  given  the  current  estimate  6>r>  and  the  observed  data  Y0bs.  Notice  that 
Ymjs  is  in  general  a  matrix  so  that  this  notation  refers  to  (in  principle)  a  series  of  draws. 
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The  prediction  step  (P-step)  is  executed  by  making  a  draw  from  the  complete  data 
posterior 


0(r+1)~  Pr[0|Yobs,Y^].  (27.11) 

That  is,  Y0bs  is  augmented  by  an  imputed  value  Y^T1*  drawn  from  the  predictive  dis¬ 
tribution  of  Ym;s,  and  a  draw  is  made  from  the  posterior  distribution  of  6.  The  steps 
(27.10)  and  (27.1 1)  can  then  be  repeated. 

Sequential  sampling  from  the  two  distributions  generates  a  Markov  chain.  This  pro¬ 
cess,  which  strongly  resembles  the  EM  algorithm,  is  essentially  the  Gibbs  sampler  of 
Section  13.5.2,  but  in  the  missing  data  literature  it  is  referred  to  as  data  augmentation. 
Under  appropriate  conditions,  and  by  a  theorem  cited  in  Section  13.5.1,  the  sequen¬ 
tial  draws  will  converge  to  a  stationary  distribution  for  a  sufficiently  large  value  of  r, 
which  is  the  length  of  the  chain.  When  the  chain  is  terminated  we  have  one  imputation 
°f  Ymjs.  Then  we  can  regard  0(l  >  as  an  approximate  draw  from  Pr[0|Yobs]  and  Y^T1) 
as  an  approximate  draw  from  Pr[Ymis|Yobs].  As  with  any  MCMC  application  the  chain 
has  to  run  sufficiently  long  to  ensure  that  successive  imputations  are  free  of  statistical 
dependence.  These  issues  have  been  discussed  in  Chapter  13. 

After  convergence  we  would  have  accomplished  the  joint  objectives  of  imputing 
the  missing  values  based  on  the  model  specified  for  the  data  and  estimating  the  model 
using  both  observed  and  imputed  values.  Postconvergence  we  would  have  the  data 
necessary  to  compute  the  posterior  moments  of  6  and  any  interesting  functions  of  6 
and  Y  using  the  ideas  discussed  in  Chapter  13. 

As  a  specific  illustration  of  this  procedure  we  reconsider  the  missing  data  re¬ 
gression  example  of  the  previous  section.  The  steps  in  the  MCMC  algorithm  are 
as  follows: 

1.  Using  observed  data  calculate  (3=  [XjXj]  1  Xjyi,  and  u  =  (yi  -  x,3). 

2.  Generate  a 2  as  u'u  divided  by  a  draw  from  Xn^-k  distribution. 

3.  Draw  (3\a2  ~  N\fi,o2  [XjX,]-1]. 

4.  Draw  ymis  ~  J U[X2/3,a2]. 

5.  Using  y  instead  of  yi,  and  X  instead  of  Xi,  repeat  steps  1-4  after  appropriate  adjust¬ 
ments. 

The  justification  for  step  2  is  that,  under  an  uninformative  prior  for  (J3,cr2),  the  con¬ 
ditional  posterior  distribution  of  u'u/tr2  is  Xni-k  'f  onb'  the  observed  data  are  used. 
After  data  augmentation  this  changes  to  Xn-k-  The  justification  for  step  3  is  that,  un¬ 
der  an  uninformative  prior,  the  conditional  posterior  distribution  is  A/”[/3 ,  cr 2  [X':  X  j  ]  1  ] . 
After  data  augmentation  this  changes  to  A/”[/3 , cr 2 [X'X]  1  ] .  Step  4  is  the  impu¬ 
tation  step  using  the  conditional  predictive  density  J\f[X2(3,a2].  These  steps  can 
be  appropriately  modified  if  we  use,  for  example,  an  informative  normal-gamma 
prior  for  (j3,cr2).  The  conditional  posterior  distributions  for  this  case  are  given  in 
Section  13.3. 
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27.7.  Multiple  Imputation 

The  analysis  of  the  preceding  section  explains  how  a  full  MCMC  run  will  generate 
a  single  imputation.  However,  a  single  imputation  does  not  adequately  handle  the 
missing-data  uncertainty.  This  is  the  essential  rationale  for  using  a  multiple  imputa¬ 
tion  procedure.  The  conditional  predictive  distribution  of  Ymjs|Yobs,  9  is  obtained  by 
averaging  over  the  observed-data  posterior  of  9: 

Pr[Ymis|Yobs]  =  J  Pr[Ymis|Yobs,0]Pr[0|Yobs]  AO. 

Proper  multiple  imputations  from  a  Bayesian  viewpoint  reflect  uncertainty  about  Ym;s, 
given  the  uncertainty  about  parameters  of  the  model. 

After  multiple  imputation  the  missing  data  Ym;s  are  replaced  by  simulated/imputed 
values  Y[PS,  Y^-s,  Y'nbs, . . . ,  Y["7.  Each  of  the  complete  data  sets  is  then  analyzed  as 
if  it  were  complete.  The  results  from  the  m  analyses  will  show  variation  that  reflects 
the  uncertainty  resulting  from  the  missing  data.  With  m  different  data  sets  questions 
arise  about  how  one  should  determine  an  appropriate  value  for  m  and  how  the  m 
sets  of  parameter  estimates  and  covariance  matrices  should  be  combined.  We  address 
both  of  these  questions  using  results  from  the  literature  but  without  providing  detailed 
justification. 

In  considering  how  to  combine  the  results  based  on  multiply  imputed  data  the  key 
result,  stated  for  an  arbitrary  statistic  Q,  is 

Prte  |Yobs]  =  j  Pr[<2  |  Ymis,  Yobs]Pr  [Ymis|Yobs]  dYmis,  (27.12) 

which  states  that  the  actual  posterior  distribution  of  Q,  is  obtained  by  averaging  over 
the  complete-data  posterior  distribution  of  Q.  This  means  averaging  over  the  results 
of  multiple  imputations  of  missing  observations  (Rubin,  1996). 

Equation  (27.12)  implies  that  the  final  estimate  of  Q  is  given  by  the  law  of  iterated 
expectations, 


EteiYobs]  =  E[E[g|Yobs,  Ymis]|Yobs],  (27.13) 

The  posterior  mean  of  Q  is  the  average  of  Qr  using  complete  data  after  repeated 
imputation  of  missing  data. 

The  final  variance  of  Q  is  given  by  the  formula 

V[0|Yobs]  =  E[V[<2|Yobs,  Ymis]|Yobs]  +  V[E[<2|Yobs,  Ymis]|Yobs],  (27.14) 

using  the  variance  decomposition  formula  given  in  Section  A.  8. 

Rubin  (1996)  also  gives  the  following  rules  for  combining  moment  information, 
stated  in  terms  of  a  scalar  parameter.  For  an  arbitrary  scalar  parameter,  suppose  Q,  is 
a  point  estimate  at  the  rth  imputation  and  Ur  is  a  variance  estimate.  Then  define  the 
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Table  27.1.  Relative  Efficiency  of  Multiple  Imputation 


Number  of 
Imputations  (m) 

Observations  Missing  (A  ) 

10% 

30% 

50% 

3 

0.967 

0.909 

0.857 

10 

0.990 

0.970 

0.952 

20 

0.995 

0.985 

0.975 

averages  of  the  point  and  variance  estimate,  respectively,  as 


m 


Q  =  m  1  Q r- 

r=l 

(27.15) 

m 

U  =  m-^Ur 

r=  1 

(27.16) 

and  the  between-imputation  variance  as 

m 

B  =  (m-\)-lYj(Qr-Q )2 

r=  1 

(27.17) 

and  the  total  variance  as 

T  =  U  +  (1  +  m~l)  B. 

(27.18) 

The  results  (27.15,  27.16)  follow  from  (27.13);  Equation  (27.18)  follows  from 
(27.14).  Schafer  (1997)  gives  results  for  combining  /7-values  and  likelihood  ratio 
statistics  and  provides  additional  references. 

Postimputation  inference  regarding  individual  coefficients  or  subsets  of  coefficients 
can  be  carried  out  using  the  final  estimates,  since  the  standard  central  limit  theorem 
and  the  associated  large-sample  results  can  be  extended  to  cover  this  case. 

The  following  is  a  measure  of  the  relative  efficiency  of  m  multiple  imputations: 

reff  =  (1  +  ( k/m ))-'  , 

(27.19) 

where  k  is  the  fraction  of  missing  observations.  Efficiency  is  measured  relative  to  no 
missing  data.  The  arithmetical  calculations  in  Table  27.1  show  that  with  as  few  as  three 
imputations  the  efficiency  can  be  as  high  as  97%  with  10%  missing  data,  and  86% 
with  50%  missing  data.  With  10  or  more  imputations  the  relative  efficiency  exceeds 
95%  with  50%  missing  data.  Thus,  as  emphasized  by  Schafer  (1997),  the  number  of 
imputations  need  not  be  very  high. 


27.8.  Missing  Data  MCMC  Imputation  Example 

This  section  gives  two  illustrative  applications  of  missing  data  imputation:  the  model- 
free  methods  of  listwise  deletion  and  mean  imputation  (see  Section  27.2),  and  the 
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model-based  method  of  data  augmentation  using  the  MCMC  algorithm  (see  Sec¬ 
tion  27.6).  Only  data  on  regressors  are  missing  and  the  missing  mechanism  is  MAR. 

The  first  application  involves  simple  multiple  regression,  and  the  second  involves 
a  logit  regression.  For  clarity  and  simplicity  we  use  artificially  generated  data  with  a 
known  dgp. 

27.8.1.  Linear  Regression  with  Missing  Data  on  Regressors 
For  the  linear  regression  example  the  dgp  is 

yi  —  Pa  +  PiXu  +  $2*2;  +  Ui,  i  =  1,  2, . . . ,  N,  (27.20) 

with  Hi  \xu  ,  xn  ^  A/"[0,  cr2]  and  (at,  ,  X2,  )  bivariate  normally  distributed  with 


Xu  " 

-a r 

"o' 

’  1  pT 

_X2i_ 

_0_ 

_p  1 J. 

so  that  X2i\xu  ^  J\f[pxu ,  1  —  p2].  Also,  we  set  (3'  =  [  1  1  1  ],  N  =1,000,  and  the 

proportion  of  randomly  missing  data  on  at  and  X2  to  either  10%  or  25%.  For  any  i, 
either  x\  or  X2,  or  both,  may  be  missing.  We  also  use  two  different  values  of  p,  0.36 
and  0.64. 

For  the  Markov  chain  we  use  500  iterations  for  the  burn-in  phase.  The  Markov  chain 
calculations  are  implemented  using  the  SAS  MI  Proc  algorithm,  which  uses  an  unin¬ 
formative  prior.  For  demonstration  purposes  only,  the  number  of  imputations  is  fixed 
at  10  but  the  length  of  the  chain  after  the  burn-in  phase  varies  from  10  to  10,000.  Proc 
MI  combines  the  results  from  multiple  imputations  using  Equations  (27. 15)— (27. 18). 

Tables  27.2  and  27.3  present  results  for  high  p  and  low  and  high  rates  of  missing 
data.  There  are  no  dramatic  differences  among  methods.  Because  the  MAR  assump¬ 
tion  applies,  point  estimates  from  listwise  deletion  and  the  full  sample  remain  close, 
but  as  expected  the  standard  errors  are  larger  under  listwise  deletion.  Under  mean  im¬ 
putation  the  point  estimate  of  /T  diverges  relatively  more,  but  the  observed  variation 
is  well  within  the  bounds  of  sampling  error.  It  appears  that  in  both  cases  the  Markov 
chain  attains  stationarity  rather  rapidly,  there  being  very  little  difference  between  the 


Table  27.2.  Missing  Data  Imputation:  Linear  Regression  Estimates  with  10% 
Missing  Data  and  High  Correlation  Using  MCMC  Algorithm 


No  Data  Listwise  Mean  Length  of  the  Markov  Chain 


Missing 

Deletion 

Impute 

10 

1,000 

5,000 

10,000 

% 

0.919 

0.913 

0.899 

0.910 

0.911 

0.909 

0.903 

(0.104) 

(0.113) 

(0.105) 

(0.102) 

(0.101) 

(0.103) 

(0.101) 

% 

1.097 

1.067 

1.053 

1.196 

1.205 

1.199 

1.199 

(0.138) 

(0.151) 

(0.141) 

(0.148) 

(0.155) 

(0.144) 

(0.147) 

% 

1.000 

1.072 

1.112 

1.042 

1.051 

1.041 

1.055 

R2 

(0.132) 

0.240 

(0.145) 

0.254 

(0.135) 

0.226 

(0.140) 

(0.146) 

(0.143) 

(0.146) 
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Table  27.3.  Missing  Data  Imputation:  Linear  Regression  Estimates  with  25% 
Missing  Data  and  High  Correlation  Using  MCMC  Algorithm 


No  Data  Listwise  Mean  Length  of  the  Markov  Chain 


Missing 

Deletion 

Impute 

10 

1,000 

5,000 

10,000 

Po 

0.919 

0.863 

0.984 

0.899 

0.898 

0.925 

0.900 

(0.104) 

(0.167 

(0.108) 

(0.108) 

(0.105) 

(0.111) 

(0.110) 

% 

1.097 

1.048 

1.062 

1.028 

1.047 

1.082 

0.987 

(0.138) 

(0.167 

(0.150) 

(0.152) 

(0.166) 

(0.161) 

(0.155) 

Pi 

1.000 

1.129 

1.156 

1.071 

1.085 

1.024 

1.124 

R2 

(0.132) 

0.240 

(0.161) 

0.268 

(0.148) 

0.203 

(0.152) 

(0.144) 

(0.172) 

(0.152) 

results  with  10  and  10,000  iterations.  This  is  probably  due  to  having  set  the  number  of 
burn-in  iterations  at  500,  which  may  be  higher  than  needed  for  this  relatively  simple 
case. 

In  Table  27.4  the  simulation  exercise  is  repeated  for  the  “worst-case”  scenario  of 
low  p  and  25%  missing  data.  The  divergence  between  the  point  estimates  from  the 
full  sample  and  those  from  listwise  deletion  and  mean  imputation  cases  is  overall  rel¬ 
atively  greater  than  that  for  the  MCMC  cases.  However,  even  in  this  case  there  are 
no  really  dramatic  differences  between  estimates  from  the  full  sample.  Once  again 
we  see  that  the  benefit  of  running  a  long  Markov  chain  are  not  apparent  in  this 
example. 


27.8.2.  Logit  Regression  with  Missing  Data  on  Regressors 

We  next  consider  an  example  of  a  nonlinear  model  with  missing  data  on  regressors 
using  simulated  data.  In  this  simulation  example  we  retain  the  dgp  given  before  but 
change  the  dependent  variable  into  a  discrete  dichotomous  variable.  First,  reinterpret 


Table  27.4.  Missing  Data  Imputation:  Linear  Regression  Estimates  with  10% 
Missing  Data  and  Low  Correlation  Using  MCMC  Algorithm 


No  Data  Listwise  Mean  Length  of  the  Markov  Chain 


Missing 

Deletion 

Impute 

10 

1,000 

5,000 

10,000 

Po 

1.121 

1.162 

1.142 

1.149 

1.155 

1.154 

1.141 

(0.099) 

(0.130) 

(0.103) 

(0.104) 

(0.103) 

(0.104) 

(0.101) 

Pi 

1.099 

0.930 

1.052 

1.026 

1.020 

1.004 

1.044 

(0.107) 

(0.134) 

(0.121) 

(0.127) 

(0.128) 

(0.124) 

(0.124) 

Pi 

1.102 

1.122 

1.215 

1.130 

1.157 

1.137 

1.151 

R2 

(0.107) 

0.243 

(0.134) 

0.235 

(0.124) 

0.186 

(0.128) 

(0.129) 

(0.129) 

(0.119) 
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Table  27.5.  Missing  Data  Imputation:  Logistic  Regression  Estimates  with  10% 
Missing  Data  and  High  Correlation  Using  MCMC  Algorithm 


No  Data  Listwise  Mean  Length  of  the  Markov  Chain 


Missing 

Deletion 

Impute 

10 

1,000 

5,000 

10,000 

Po 

-  0.447 

-  0.498 

-  0.439 

-  0.527 

-  0.534 

-  0.531 

-  0.539 

(0.070) 

(0.078) 

(0.070) 

(0.073) 

(0.073) 

(0.072) 

(0.073) 

Pi 

-  0.597 

-  0.658 

-  0.602 

-  0.620 

-  0.673 

-  0.681 

-  0.675 

(0.096) 

(0.108) 

(0.098) 

(0.106) 

(0.102) 

(0.101) 

(0.103) 

Pi 

-  0.444 

-  0.474 

-  0.523 

-  0.597 

-  0.540 

-  0.536 

-  0.553 

(0.092) 

(0.103) 

(0.094) 

(0.107) 

(0.103) 

(0.099) 

(0.101) 

the  simulation  design  given  for  the  linear  regression  example,  so  y  =  y*,  a  latent  vari¬ 
able.  Let  the  dgp  be 


y*  —  Po  +  Pixu  +  PiX2i  +ut,  i  =  1,  2, . . . ,  N.  (27.22) 

Then  a  dichotomous  y-t  is  generated  according  to  the  following  rule: 


1  if  y*  >  0, 
0  if  y*  <  0. 


(27.23) 


We  will  model  the  probability  that  y,  =  0  using  the  logit  model,  even  though  the  dgp 
is  that  for  the  probit  model.  As  discussed  in  Section  14.4.1,  the  logit  model  identifies 
the  parameter  vector  / 3/a ,  where  the  variance  a2  =  n2/3.  With  all  elements  of  /3  set 
equal  to  one,  the  logit  model  will  provide  estimates  of  the  true  parameter  value  of  ap¬ 
proximately  —0.551.  The  MCMC  estimation  is  set  up  as  before  with  a  noninformative 
prior. 

Tables  27.5  covers  the  favorable  case  with  10%  missing  data  and  high  correlation 
between  x\  and  X2,  and  Table  27.6  covers  the  less  favorable  case  with  25%  missing 
data  and  low  correlation  between  x\  and  x2. 

In  the  first  case,  even  with  no  missing  data  the  estimate  f)2  is  substantially  off  its 
expected  value.  The  MCMC  point  estimates  change  somewhat  when  the  length  of 
the  Markov  chain  is  increased  from  10  to  1,000.  However,  more  when  simulations 
are  implemented,  there  is  only  slight  change  in  point  estimates,  a  result  that  we  can 
interpret  as  an  indication  of  convergence  of  the  chain  to  its  stationary  distribution. 

For  the  second  example  involving  a  less  favorable  simulation  design,  the  results 
are  as  shown  in  Table  27.6.  The  main  difference  is  that  the  divergence  between  the 
expected  point  estimates  and  the  estimated  values  is  somewhat  larger  for  the  previous 
case.  However,  broadly  speaking  the  performance  of  the  multiple  imputation  method 
in  the  logistic  regression  is  similar  to  that  in  the  linear  regression. 
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Table  27.6.  Missing  Data  Imputation:  Logistic  Regression  Estimates  with  25%  Missing 
Data  and  Low  Correlation  Using  MCMC  Algorithm 


No  Data  Listwise  Mean  Length  of  the  Markov  Chain 


Missing 

Deletion 

Impute 

10 

1,000 

5,000 

10,000 

% 

-  0.447 

-  0.658 

-  0.582 

-  0.605 

-  0.609 

-  0.609 

-  0.599 

(0.070) 

(0.097) 

(0.070) 

(0.074) 

(0.074) 

(0.073) 

(0.076) 

% 

-  0.597 

-  0.434 

-  0.470 

-  0.447 

-  0.470 

-  0.471 

-  0.481 

(0.096) 

(0.100) 

(0.085) 

(0.090) 

(0.094) 

(0.094) 

(0.082) 

% 

-  0.444 

-  0.593 

-  0.648 

-  0.634 

-  0.615 

-  0.576 

-  0.596 

(0.092) 

(0.108) 

(0.089) 

(0.084) 

(0.086) 

(0.086) 

(0.094) 

27.9.  Practical  Considerations 

A  major  implication  of  the  analysis  of  this  chapter  for  practice  is  that  analysis  of  mul¬ 
tiply,  rather  than  singly,  imputed  data  has  theoretical  advantages.  Moreover,  model- 
based  approaches  are  less  ad  hoc  than  mechanical  approaches  such  as  mean  imputation 
or  hot  deck.  In  many  realistic  applications  devising  an  MCMC-type  imputation  proce¬ 
dure  may  pose  a  significant  challenge,  however,  compared  to  the  relative  simplicity  of 
the  examples  given  in  the  last  section. 

A  distinction  may  be  drawn  between  multiple  imputations  where  the  end  product 
is  the  data  and  one  in  which  the  end  product  consists  of  estimated  coefficients  for 
inference.  Although  both  procedures  may  be  model  based  the  second  may  involve 
more  complex  econometric  models.  Examples  are  provided  by  Brownstone  and  Valetta 
(1996),  Stinebrinkner  (1999),  Kennickell  (1998),  and  Davey,  Shanahan,  and  Schafer 
(2001). 

Even  when  the  primary  object  is  imputation,  without  extensive  modeling  the  prob¬ 
lem  may  be  far  from  simple.  For  example,  in  his  study  of  the  1995  Survey  of  Consumer 
Finances,  Kennickell  (1998,  p.  5)  remarks: 

[When]  the  survey  contains  a  very  large  number  of  variables,  there  is  substantial 
missing  or  partially  missing  (range)  information,  the  patterns  of  missing  informa¬ 
tion  are  highly  heterogeneous,  the  distributions  of  many  of  the  variables  are  highly 
skewed,  and  the  data  have  a  complex  structure,  [then],  analysis  of  the  survey  in  the 
absence  of  imputation  would  be  a  formidable  task.  Moreover,  anyone  using  the  pub¬ 
lic  version  of  the  data  set  would  lack  key  frame  data  that  turn  out  to  be  important 
for  understanding  the  distributions  of  the  missing  data.  Thus,  even  on  pure  efficiency 
grounds,  there  is  a  good  case  for  imputing  the  missing  data. 

Despite  the  complexity  of  the  problem  Kennickell  was  able  to  use  imputation  proce¬ 
dures  similar  to  those  discussed  in  this  chapter. 

Stinebrinkner  (1999),  also  facing  a  missing  data  situation  in  which  listwise  deletion 
“would  leave  the  econometrician  with  too  little  data  to  estimate  the  model  of  interest,” 
develops  a  two-stage  simulated  likelihood-based  procedure  for  estimating  the  joint 
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distribution  of  the  missing  data  and  estimating  duration  model  for  the  first  teaching 
spell. 

For  relatively  simple  cases  software  such  as  the  SAS  package  Proc  MI  may  be 
used.  S-Plus  and  SOLAS  also  provide  software  support.  A  helpful  guide  and  survey 
of  computer  software  packages  is  given  in  Florton  and  Lipsitz  (2001).  For  additional 
information  see  the  relevant  Web  sites. 

Most  of  the  analysis  of  the  chapter  is  based  on  assuming  an  ignorable  missing  data 
mechanism.  From  an  econometric  viewpoint  this  might  be  a  major  simplification.  For 
example,  see  Lillard,  Smith,  and  Welch  (1986),  who  critique  the  Census  hot  deck 
method  for  imputing  missing  wages.  Flow  should  one  proceed  if  the  mechanism  is 
nonignorable?  In  the  notation  of  Section  27.4,  a  nonignorable  missing  data  mecha¬ 
nism  would  imply  that  parameters  6  and  are  not  distinct.  Then  one  must  specify 
the  missing  data  mechanism  explicitly,  as  in  the  case  of  selection  models  and  models 
of  attrition  bias  (see  Chapter  16  and  Section  23.5.2).  Schafer  (1997,  p.  28)  provides 
some  relevant  references  to  the  literature. 

27.10.  Bibliographic  Notes 

Important  early  references  include  Little  and  Rubin  (1987)  and  Rubin  (1987).  Allison  (2002) 
provides  a  relatively  nontechnical  but  lucid  introduction  to  the  missing  data  problem  and  lit¬ 
erature.  Rubin  (1996)  provides  a  survey  with  historical  perspective.  Schafer  (1997)  provides  a 
more  complete  analysis  that  covers  categorical  data,  mixed  discrete-continuous  data,  and  data 
from  complex  surveys. 

27.2  Meng  (2000)  provides  a  historical  perspective  on  the  missing  data  mechanism. 

27.5  Little  ( 1988,  1992)  provides  a  good  review  of  the  literature  on  linear  regression  with  miss¬ 
ing  regressors,  covering  both  non-model-based  and  model-based  approaches. 


- Exercises - 

27-1  Consider  any  regression  model,  linear  or  nonlinear,  with  dependent  variable  y 
and  exogenous  variables  x,  and  iid  errors  s.  Show  that  if  the  probability  of  miss¬ 
ing  data  on  x  is  independent  of  y,  then  the  regression  based  on  listwise  deletion 
will  provide  a  consistent  estimate  of  the  conditional  mean  function.  [Hint:  Show 
that  the  conditional  distribution  of  y  given  x  is  not  affected  by  missing  observa¬ 
tions.] 

27-2  (Adapted  from  Gourieroux  and  Monfort,  1981).  Consider  the  regression  model 
y  =  /fix  +  Z/32  +  u,  where  y  is  an  N  x  1  vector,  Z  is  an  N  x  K  matrix,  and  x 
is  an  N  x  1  vector  of  a  scalar  regressor,  some  of  whose  elements  are  miss¬ 
ing.  Assume  that  observations  are  missing  at  random  and  E[u|x,  Z]  =  0  and 
E[uujx,  Z]  =  cr2l/v.  Both  y  and  Z  are  fully  observed.  The  following  approach 
is  proposed  to  deal  with  the  missing  data.  Assume  a  linear  regression  model 
relating  x  to  Z,  x  =  Z7  +  e,  where  E[e|  Z]  =  0  and  E[eejZ]  =  af\N.  Then  let 
7  =  [Z[.Z0]_1ZgXc,  where  the  subscript  c  refers  to  “complete  data.”  Impute  val¬ 
ues  of  xm  =  Zm[Z'cZc]~1Z'cxc,  where  xm  refers  to  the  missing  observations  and 
Zm  to  the  corresponding  values  of  Z.  The  original  regression  is  then  reestimated 
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using  the  full  set  of  N  observations  after  replacing  the  missing  values  of  x  by 
imputed  values. 

(a)  Explain  why  the  OLS  regression  estimator  based  on  complete  and  imputed 
observations  might  be  biased  in  finite  samples. 

(b)  What  additional  assumptions  are  required  to  prove  that  the  OLS  estimator 
based  on  complete  plus  imputed  values  is  consistent? 

(c)  Is  the  OLS  estimator  efficient? 

27-3  Consider  the  point  that  when  estimation  of  a  model  is  undertaken  after  data  im¬ 
putation  the  precision  of  the  estimates  is  likely  to  be  overstated  if  no  adjustment 
is  made  for  the  imputation  step.  In  other  words,  imputed  data  may  be  regarded 
as  generated  variables  and  hence  subject  to  the  problem  of  the  sequential  two- 
step  estimator  discussed  in  Section  6.6.  Explain  whether  an  adjustment  related 
to  imputation  of  missing  data  is  necessary  asymptotically. 
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Asymptotic  Theory 


A.l.  Introduction 

In  this  appendix  we  consider  the  behavior  of  a  sequence  of  random  variables  as 

N  — »■  oo. 

In  applications  the  index  N  is  the  sample  size  and  the  sequence  is  an  estimator, 
such  as  /;  or  9,  or  a  component  of  an  estimator,  such  as  A-1  JT  x}  or  A  1  x,  w,  in 
the  case  of  OLS  with  one  regressor  and  no  intercept,  or  a  test  statistic. 

For  estimation  theory  it  is  sufficient  to  focus  on  two  aspects  of  the  behavior  of  the 
sequence  b n  as  N  oo.  First,  we  consider  convergence  in  probability  of  by  to  a 
limit  b,  a  constant  or  random  variable  that  is  very  close  to  h  v  in  a  probabilistic  sense 
defined  in  the  following.  Second,  if  the  limit  b  is  a  random  variable,  which  may  require 
a  rescaling  of  the  original  sequence,  we  consider  the  limit  distribution. 

Estimators  are  usually  functions  of  averages  or  sums.  Then  it  is  easiest  to  derive 
limiting  results  by  invoking  results  on  the  behavior  of  averages,  notably  laws  of  large 
numbers  and  central  limit  theorems.  The  notation  used  is  to  consider  an  average 
XN  =  N  1  JT  Xj,  where  A,  here  is  generic  notation  for  a  random  variable  being  av¬ 
eraged  and  should  not  be  confused  with  the  use  of  x,  to  denote  the  regressor  vector.  For 
example,  for  OLS  with  one  regressor  and  no  intercept  we  will  apply  a  law  of  large  num¬ 
bers  to  the  average  of  A,  =  xf  and  a  central  limit  theorem  to  the  average  of  A,  =  x,  u  ,■ . 

Table  A.  1  summarizes  the  definitions  and  theorems  presented  in  the  remainder  of 
this  appendix.  These  are  stated  without  proof  but  with  some  discussion.  The  focus  is 
on  results  used  to  obtain  asymptotically  normal  estimators,  the  usual  case  when  cross- 
section  data  are  used.  Additional  results  are  needed  for  application  to  nonparametric 
estimation,  to  parametric  estimation  when  the  support  of  the  data  depends  on  parame¬ 
ters,  and  to  time  series  estimation  when  data  have  unit  roots. 

The  first  key  concept,  convergence  in  probability,  is  presented  in  Section  A.2.  This  is 
established  using  laws  of  large  numbers  given  in  Section  A. 3.  The  other  key  concept, 
convergence  in  distribution,  is  presented  in  Section  A.4.  Convergence  to  the  normal 
distribution  is  established  using  central  limit  theorems  given  in  Section  A.5.  Further 
results  and  common  terminology  for  limit  multivariate  normal  distributions  are  given 
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Table  A.l.  Asymptotic  Theory:  Definitions  and  Theorems 


Definition 

Theorem 

Name 

Equation 

A.l 

Convergence  in  Probability 

(A.l) 

A. 2 

Consistency 

(A. 2) 

A.3 

Slutsky 

(A.3) 

A.4 

Mean-Square  Convergence 

(A.4) 

A. 5 

Chebychev’s  Inequality 

(A. 5) 

A. 6 

Almost  Sure  Convergence 

(A. 6) 

A.7 

Law  of  Large  Numbers 

(A.7) 

A. 8 

Kolmogorov  LLN 

A. 9 

Markov  LLN 

A. 10 

Convergence  in  Distribution 

(A. 9) 

A.  11 

Continuous  Mapping 

(A.  10) 

A. 12 

Transformation 

(A.  11) 

A. 13 

Central  Limit  Theorem 

(A.  13) 

A. 14 

Lindeberg-Levy  CLT 

A. 15 

Liapounov  CLT 

A. 16 

Cramer- Wold  Device 

A. 17 

Limit  Normal  Product  Rule 

(A. 15) 

A.  18 

Asymptotic  Distribution 

(A.  17) 

A. 19 

Asymptotic  Variance 

(A.  18) 

A. 20 

Estimated  Asymptotic  Variance 

(A.  19) 

A. 21 

Asymptotic  Efficiency 

A.22 

Stochastic  Order  of  Magnitude 

in  Section  A.6.  Stochastic  order  of  magnitude,  a  convenient  notation  commonly  used 
in  asymptotic  analysis,  is  presented  in  Section  A. 7.  Section  A. 8  presents  some  useful 
properties  of  expectations. 


A.2.  Convergence  in  Probability 

Because  of  the  intrinsic  randomness  of  a  sample  we  can  never  be  certain  that  a  se¬ 
quence  bn,  such  as  an  estimator  6  (often  denoted  6^  to  make  clear  that  it  is  a  se¬ 
quence),  will  be  within  a  given  small  distance  of  its  limit,  even  if  the  sample  is  in¬ 
finitely  large.  However,  we  can  be  almost  certain.  Different  ways  of  expressing  this 
near  certainty  correspond  to  different  types  of  convergence  of  a  sequence  of  random 
variables  to  a  limit.  The  one  most  used  in  econometrics  is  convergence  in  probability. 

A.2.1.  Convergence  in  Probability 

Recall  that  a  sequence  of  nonstochastic  real  numbers  {aN}  converges  to  a  if,  for  any 
s  >  0,  there  exists  N*  =  N*(s )  such  that,  for  all  N  >  N*, 

| asi  —  a\  <  e. 
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For  example,  if  =  2  +  3/N,  then  the  limit  is  a  =  2  since  «,v  —  a\  =  |2  +  3/N  — 
2|  =  |3/Af  |  <  s  for  all  N  >  N*  =  3/s. 

When  more  generally  we  have  a  sequence  of  random  variables  we  cannot  be  certain 
of  being  within  s  of  the  limit,  even  for  large  N,  because  of  intrinsic  randomness. 
Instead,  we  require  that  the  probability  of  being  within  s  is  arbitrarily  close  to  one. 
Thus  we  require 


lim  Pr[|fr;v  —  b\  <  e]  =  1, 

N—>oo 

for  any  s  >  0.  A  formal  definition  is  the  following: 

Definition  A.l  (Convergence  in  Probability):  A  sequence  of  random  variables 

{b^}  converges  in  probability  to  b  if,  for  any  e  >  0  and  8  >  0,  there  exists 

N*  =  N*(s,  8)  such  that,  for  all  N  >  N*, 

Pril^A-  —  b\  <  s]  >  1  -  8.  (A.l) 

We  write  plim  /?  y  =  b,  where  plim  is  shorthand  for  probability  limit,  or  /?y  -a-  b. 

Note  that  b  may  be  a  constant  or  a  random  variable.  Convergence  in  probability 
includes  as  a  special  case  the  usual  definition  of  convergence  for  a  sequence  of  real 
variables. 

Definition  A.  1  is  for  a  sequence  of  scalar  random  variables.  The  extension  to  vector 
random  variables,  such  as  a  parameter  vector  estimator,  is  straightforward.  We  can 
either  apply  the  theory  for  each  element  of  b^,  or  replace  /;;V  —  b\  by  the  scalar  (b;y  — 
by(bjv  —  b)  =  (b[N  —  b\)2  +  •  •  •  +  (bKN  —  bK )2  or  its  square  root  ||b^  —  b|  |. 

When  the  sequence  {b^}  is  a  sequence  of  parameter  estimates  9,  we  have  the  fol¬ 
lowing  large  sample  analogue  of  unbiasedness. 

Definition  A.2  (Consistency):  An  estimator  6  is  consistent  for  90  if 

plim0  =  0o-  (A.2) 

The  subscript  0  on  6  is  explained  in  Section  5.2.3.  Note  that  unbiasedness  need 
not  imply  consistency.  Unbiasedness  states  only  that  the  expected  value  of  6  is  0q, 
and  it  permits  variability  around  90  that  need  not  disappear  as  the  sample  size  goes  to 
infinity.  Also,  a  consistent  estimator  need  not  be  unbiased.  For  example,  adding  1  /N 
to  an  unbiased  and  consistent  estimator  produces  a  new  estimator  that  is  biased  but 
still  consistent. 

Although  the  sequence  of  vector  random  variables  {b  y }  may  converge  to  a  random 
variable  b,  in  many  econometric  applications  { b,y }  converges  to  a  constant.  For  ex¬ 
ample,  we  hope  that  an  estimator  of  a  parameter  will  converge  in  probability  to  the 
parameter  itself.  One  should  be  aware  that  some  of  the  results  that  follow  apply  only 
if  the  limit  value  b  is  a  constant. 

Theorem  A.3  (Slutsky’s  Theorem):  Let  b,y  be  a  finite-dimensional  vector  of 

random  variables,  and  g(-  )  be  a  real-valued  function  continuous  at  a  constant 
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vector  point  b.  Then 

bv  4  b  =>  *(bw)  -4  g(b).  (A.3) 

Proof  is  given  in  Amemiya  (1985,  p.  79).  Ruud  (2000)  presents  a  related  result  (see 
also  Rao,  1973,  p.  124)  that  lets  the  limit  b  be  a  random  variable,  at  the  expense  of 
restricting  g(-)  to  be  continuous  everywhere.  Note  that  some  authors  instead  refer  to 
Theorem  A.  12  below  as  Slutsky’s  Theorem. 

Theorem  A.3  is  one  of  the  major  reasons  for  the  prevalence  of  asymptotic  re¬ 
sults  versus  finite-sample  results  in  econometrics.  It  states  a  very  convenient  property 
that  does  not  hold  for  expectations.  For  example,  plim(i>iW,  /jt  v)  =  (b\ ,  bf)  implies 
plim(Z?i;v(?2w)  =  bfbo,  whereas  E[/?|V^2vl  generally  differs  from  E[/?|  ]E[/;i|. 


A.2.2.  Alternative  Modes  of  Convergence 

It  is  often  easier  to  establish  alternative  modes  of  convergence,  which  in  turn  imply 
convergence  in  probability. 

These  alternative  modes  are  given  for  completeness.  Laws  of  large  numbers,  given 
in  the  next  section,  are  used  much  more  often. 

Definition  A.4  (Mean-Square  Convergence):  A  sequence  of  random  variables 

{/?  v  1  is  said  to  converge  in  mean  square  to  a  random  variable  b  if 

lim  E [(bN  -  b )2]  =  0.  (A.4) 

N^-oo 

We  write  b^  — >  b.  Convergence  in  mean  square  is  useful  because  bN  4  b  implies 
bN  -a-  b  (see  Rao,  1973,  p.  1 10)  and  is  often  easy  to  prove.  This  does  require  existence 
of  the  variance  of  bN,  however.  If'  E[7>v  I  =  b,  then  we  need  to  show  that  the  variance 
of  bN  goes  to  zero  as  N  — »■  oc.  If  bN  is  instead  biased  for  b  then  we  require  that  the 
sum  of  the  variance  and  bias  squared  goes  to  zero. 

Another  result  that  can  be  used  to  show  convergence  in  probability  is  Chebychev’s 
inequality. 

Theorem  A.5  (Chebyshev’s  Inequality):  For  any  random  variable  Z  with  mean 

/j.  and  variance  a1, 

Pr[(Z  —  /x)2  >  k]  <  a1  Ik,  for  any  k  >  0.  (A.5) 

For  a  proof  see  Rao  (1973,  p.  95).  The  generalized  Chebychev’s  inequality  replaces 
(Z  —  n)2  in  Theorem  A.5  by  any  nonnegative  function  g(Z)  and  shows  that  Pr[g(Z)  > 
k\  <  E [g(Z)\/k,  for  any  k  >  0.  See  Amemiya  (1985,  p.  87). 

Theorem  A.5  can  be  used  to  verify  convergence  in  probability  by  replacing  Z  with 
bN.  The  theorem  requires  the  mean  and  variance  of  bN,  which  are  easily  obtained  for 
estimators  that  involve  an  average  of  independent  random  variables.  However,  in  such 
cases  we  can  often  take  an  even  easier  route  and  directly  apply  a  law  of  large  numbers 
to  the  average  to  obtain  the  probability  limit. 
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A  conceptually  more  difficult  type  of  convergence  is  almost  sure  convergence. 

Definition  A.6  (Almost  Sure  Convergence):  A  sequence  of  random  variables 

[bN]  is  said  to  converge  almost  surely  to  b  if 

Pr[  lim  bN  =  b].  (A.6) 

N^-OO 

This  is  denoted  bN  4  b.  Almost  sure  convergence  implies  convergence  in  proba¬ 
bility  (see  Rao,  1973,  p.  1 1 1).  Convergence  in  probability  allows  more  erratic  behavior 
in  bN  than  does  almost  sure  convergence. 

Almost  sure  convergence  is  also  called  strong  consistency  for  b,  to  distinguish 
it  from  convergence  in  probability,  which  is  called  weak  consistency  for  b.  Conver¬ 
gence  in  probability  is  easier  to  understand  and  is  sufficient  for  most  econometric 
applications. 


A.3,  Laws  of  Large  Numbers 

Laws  of  large  numbers  are  theorems  for  convergence  in  probability  (or  almost  surely) 
in  the  special  case  where  the  sequence  {/>y }  is  a  sample  average,  that  is,  /?,y  =  Ay, 
where 

(A-7> 
i= 1 

Note  that  A,  here  is  general  notation  for  a  random  variable,  and  in  the  regression 
context  it  does  not  necessarily  denote  the  regressor  variables. 

A  law  of  large  numbers  provides  a  much  easier  way  to  establish  the  probability 
limit  of  a  sequence  {fiy}  than  the  alternatives  of  brute-force  use  of  the  (<5,  s)  definition 
given  in  (A.l)  or  use  of  alternative  modes  of  convergence  that  imply  convergence  in 
probability. 

Definition  A.7  (Law  of  Large  Numbers):  A  weak  law  of  large  numbers 

(LLN)  specifies  conditions  on  the  individual  terms  A,  in  Ay  under  which 

(Ay  -  E[Ay])  4  0.  (A.8) 

For  a  strong  law  of  large  numbers  the  convergence  is  instead  almost  surely. 

It  can  be  helpful  to  think  of  a  LLN  as  establishing  that  Ay  goes  to  its  expected 
value,  even  though  strictly  speaking  it  implies  the  weaker  condition  that  Ay  goes  to 
the  limit  of  its  expected  value,  since  (A.8)  implies  that 

plimAy  =  lim£[Ay]. 

If  the  A,  have  common  mean  p,  then  this  simplifies  to  plim  Ay  =  /i. 

Two  leading  examples  of  laws  of  large  numbers  are  the  following: 

Theorem  A.8  (Kolmogorov  LLN):  Let  {A,}  be  iid  ( independent  and  iden¬ 
tically  distributed).  If  and  only  if  E[A,]  =  p  exists  and  E[|A,|]  <  oo,  then 
(Ay  -  E[Ay])  4  0. 
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Theorem  A.9  (Markov  LLN):  Let  {A,}  be  inid  (independent  but  not  identi¬ 
cally  distributed)  with  E[X,]  =  and  V[X,]  =  of.  If  (E[|X,-  —  Pi\l+S]/ 

/1+s)  <  oo,  for  some  8  >  0,  then  (Xy  —  E[Xy])  —>  0. 

See  White  (2001a,  p.  32  and  p.  35)  for  statements  of  these  theorems  and  Rao  (1973, 
pp.  1 14-1 16)  for  proofs.  Both  laws  give  the  stronger  result  of  almost  sure  convergence, 
which  implies  the  desired  convergence  in  probability.  Rao  (1973)  calls  Theorem  A.8 
Kolmogorov  LLN2  and  presents  Theorem  A.9  for  the  special  case  5=1,  which  he 
calls  Kolmogorov  LLN1. 

The  Kolmogorov  LLN  allows  the  variance  of  X,  to  not  even  exist,  at  the  expense 
of  requiring  an  identical  distribution.  It  simplifies  to  Xy  P,  where  //  =  E[X].  A 
weak  version  of  this  law,  sufficient  for  most  econometrics  applications,  is  Khinchine ’s 
Theorem,  which  states  that  for  {X,  }  iid  the  existence  of  E[X]  implies  convergence  in 
probability. 

The  Markov  LLN  no  longer  requires  an  identical  distribution,  but  it  does  require 
existence  of  an  absolute  moment  beyond  the  first.  An  obvious  choice  of  8  is  8  =  1. 
Then  the  variance  is  needed  and  the  side  condition  is  that  (of/i2)  <  oo.  The 
variance  can  vary  and  even  grow  with  /,  provided  it  does  not  grow  so  fast  that  (of  / i2) 
has  infinite  sum.  The  side  condition  is  satisfied  if  erf  =  o2,  since  YlcL i  l/*2  converges, 
but  is  not  satisfied  if  of  =  io2,  since  1/J  diverges. 

In  most  microeconometrics  applications,  including  regression  with  stratified  sam¬ 
pling  or  with  fixed  regressors,  the  more  complicated  Markov  LLN  is  needed. 

Laws  of  large  numbers  are  appealing  because  they  require  assumptions  on  the  in¬ 
dividual  components  X,  ,  rather  than  the  sequence  of  averages  X  ,y .  They  are  the  most 
common  way  econometricians  prove  convergence  in  probability,  since  most  estima¬ 
tors  and  test  statistics  are  functions  of  averages  of  the  data  and  unobserved  random 
variables. 


A.4.  Convergence  in  Distribution 

Given  consistency,  the  estimator  6  has  a  degenerate  distribution  that  collapses  on  Oo  as 
N  oo.  We  need  to  magnify  or  rescale  6  to  obtain  a  random  variable  that  has  nonde¬ 
generate  distribution  as  N  — >  oo.  Often  the  appropriate  scale  factor  is  s/N,  in  which 
case  we  consider  the  behavior  of  the  sequence  of  random  variables  fiy  =  \J~N  (0  —  Oo). 

In  general,  the  /Vth  random  variable  in  the  sequence  /?y  has  an  extremely  compli¬ 
cated  cumulative  distribution  function  (cdf)  Fy.  Like  any  other  function  Fj y,  this  may 
have  a  limit  function  where  convergence  is  in  the  usual  mathematical  sense. 

Definition  A. 10  (Convergence  in  Distribution):  A  sequence  of  random  vari¬ 
ables  {bN}  is  said  to  converge  in  distribution  to  a  random  variable  b  if 

lim  Fn  =  F,  (A.9) 

N—>oo 

at  every  continuity  point  of  F,  where  FN  is  the  distribution  of  /;iV,  F  is  the  dis¬ 
tribution  of  b,  and  convergence  is  in  the  usual  mathematical  sense. 

We  write  b,\  — >  b,  and  we  call  F  the  limit  distribution  of  {fiy }. 
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Convergence  in  probability  implies  convergence  in  distribution;  that  is,  bN  -a-  b 
implies  bN  -a-  b  (see  Rao,  1973,  p.  122). 

In  general,  the  converse  is  not  true.  For  example,  let  bN  =  X  N ,  the  IV th  realization 
of  X  ~  A f[pt,  a2].  Then  bN  b  ~  A T[fi,  a2],  but  clearly  (b^  —  b )  has  variance  that 
does  not  disappear  as  N  — >  oo,  so  Av  does  not  converge  in  probability  to  b. 

In  the  special  case  where  b  is  a  constant,  however,  bN  -4  b  implies  bN  -4  b  (see 
Rao,  1973,  p.  120).  In  this  case  the  limit  distribution  is  degenerate,  with  all  its  mass 
at  b. 

To  extend  limit  distribution  to  vector  random  variables  simply  define  FN  and  F 
to  be  the  respective  cdfs  of  vectors  b  y  and  b. 

Theorem  A.ll  (Continuous  Mapping  Theorem):  Let  by  be  a  finite- 

dimensional  vector  of  random  variables,  and  let  g(-)  be  a  continuous  real-valued 

function.  Then 

bw4b^g(b,)4g(b).  (A.  10) 

For  proof  see  Rao  (1973,  p.  124).  Theorem  A.l  1  is  the  convergence  in  distribution 
analogue  of  Theorem  A.3  for  convergence  in  probability. 

The  following  theorem  considers  the  effect  of  transforming  a  sequence  with  limit 
distribution  by  addition  of,  or  multiplication  by,  or  division  by  a  sequence  that  con¬ 
verges  in  probability  to  a  constant. 

Theorem  A. 12  (Transformation  Theorem):  If  on  -4  a  and  bN  —>  b,  where  a 

is  a  random  variable  and  b  is  a  constant,  then 

(i)  fljv  +  bN  — ^  a  T  b , 

(ii)  gnLn  "4  ab,  and  (A.ll) 

(Hi)  On /bN  -4  a/b,  provided  Pr[/?  =  0]  =  0. 

For  proof  see  Rao  (1973,  p.  122).  Theorem  A.  12  is  also  referred  to  as  Cramer’s  The¬ 
orem.  It  is  also  called  Slutsky’s  Theorem,  the  name  we  have  applied  to  Theorem  A.3. 

Theorem  A.  12  is  exceptionally  useful  because  it  permits  one  to  separately  find  the 
limit  distribution  of  on  and  the  probability  limit  of  Ay,  rather  than  having  to  consider 
the  joint  behavior  of  ay  and  Ay.  Result  (ii)  is  especially  useful  and  is  sometimes  called 
the  Product  Rule. 


A.5,  Central  Limit  Theorems 

Central  limit  theorems  are  theorems  on  convergence  in  distribution  when  the  sequence 
(Av)  is  a  sample  average.  A  central  limit  theorem  provides  a  simpler  way  to  obtain 
the  limit  distribution  of  a  sequence  {Z?y}  than  the  alternatives  such  as  brute-force  use 
of  (A.9). 

From  a  law  of  large  numbers,  the  sample  average  has  a  degenerate  distribution  as  it 
converges  to  a  constant,  limE[Xy].  So  we  scale  (Xy  — E[Xy])  by  its  standard  deviation 
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to  construct  a  random  variable  with  unit  variance  that  may  converge  to  a  nondegenerate 
distribution. 


Definition  A.13  (Central  l  imit  Theorem):  Let 


„  _  xN  —  E[Aw] 
—  - ,  — 

\/V[Aw] 


(A.  12) 


where  A a?  is  a  sample  average.  A  central  limit  theorem  (CLT)  specifies  the 
conditions  on  the  individual  terms  X,  in  Ay  under  which 


ZN  4  AT[0,  1], 


(A.13) 


that  is,  under  which  ZN  converges  in  distribution  to  a  standard  normal  random 
variable. 


By  construction  ZN  has  mean  0  and  variance  1,  so  what  needs  to  be  proved  is  the 
normality.  Formal  proofs  of  a  CLT  do  this  by  obtaining  the  characteristic  function,  a 
generalization  of  the  moment-generating  function,  of  ZN  and  showing  that  it  converges 
as  N  — >  oo  to  the  characteristic  function  of  the  standard  normal  distribution. 

Note  that  if  Xn  satisfies  a  central  limit  theorem,  then  so  too  does  h(N)X n  for 
functions  /;(•)  such  as  h(N)  =  4V,  since 

7  =  h(N)XN  -  E[h(N)X„] 
y/V[h(N)XN] 

In  many  applications  it  is  convenient  to  apply  the  central  limit  theorem  to  the  normal¬ 
ization  \/NXn  =  N^1/2  A,-,  since  VtyAAjv]  is  finite. 

Examples  of  central  limit  theorems  include  the  following: 

Theorem  A.14  (Lindeberg-Levy  CLT):  Let  {A,}  be  iid  with  E[  A,  |  =  /i  and 
V[A,]  =  ct2.  Then  ZN  4  A/"[0,  1]. 


For  a  proof,  see  Rao  (1973,  p.  127). 

This  is  the  CLT  that  usually  appears  in  introductory  statistics  texts  and  is  useful  in 
the  iid  case.  Since  A,  is  iid  [0,  a2],  ZN  simplifies  to  the  more  familiar 

Xn  —  ^ 

Ln  —  — 
a 

Note  that  in  the  iid  case  only  the  existence  of  /i  is  required  to  ensure  that  A  y  ^  ji, 
whereas  to  obtain  a  limiting  normal  distribution  requires  the  additional  assumption 
that  a2  exists. 

In  applications  such  as  OLS  with  fixed  regressors  the  iid  assumption  is  inappro¬ 
priate.  One  can  apply  a  CLT  for  {A,}  inid,  though  additional  assumptions  need  to  be 
made. 


Theorem  A.15  (Liapounov  CLT):  Let  {A,}  be  independent  with  E[A,]  =  /x, 
andV[Xi]  =  <j 2  .  If  lim(£f=i  E[|A,  -  /r,-[2+5])/(£,=i  a2)a+^2  =  0,  for  some 

choice  of  8  >  0,  then  Zn  -a-  A/"[0,  1]. 
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This  variant  of  the  Liapounov  CLT  is  proved  in  White  (2001a,  p.  1 19).  Rao  (1973, 
p.  128)  presents  the  special  case  5  =  1. 

The  main  additional  assumption  in  the  Liapounov  CLT  is  the  existence  of  an  abso¬ 
lute  moment  of  order  higher  than  two.  Note  also  the  additional  assumptions  compared 
to  the  corresponding  LLN  for  iid  data.  For  X,  inid 

EN  y 

/= l  Ai  ~  2m=i 
Z.N  ~  - /  - • 

yE,=.w 

Theorems  A.  14  and  A.  15  are  special  cases  of  the  more  general  Lindeberg-Feller 
CLT  (see  Rao,  1973,  p.  128).  The  Lindeberg-Feller  CLT  has  a  side  condition  that  can 
be  difficult  to  verify. 

In  most  microeconometrics  applications,  including  regression  with  stratified  sam¬ 
pling  or  with  fixed  regressors  the  more  complicated  Liapounov  CLT  is  used. 


A.6.  Multivariate  Normal  Limit  Distributions 

In  this  section  we  focus  on  the  typical  microeconometrics  case  of  estimators  with 
multivariate  normal  limit  distributions. 


A.6.1.  Multivariate  Normal  Limit  Distributions 

The  central  limit  theorems  presented  were  for  sequences  of  scalar  random  variables. 
They  can  be  extended  to  sequences  of  vector  random  variables  using  the  following 
result. 

Theorem  A. 16  (Cramer-Wold  Device):  Let  {b  v }  be  a  sequence  of  random 
k  x  1  vectors.  If  A'b  n  converges  to  a  normal  random  variable  for  every  k  x  1 
constant  nonzero  vector  A,  then  b,y  converges  to  a  multivariate  normal  random 
variable. 

Rao  (1973,  p.  128)  gives  a  more  general  result  that  is  not  restricted  to  normal  dis¬ 
tributions. 

The  advantage  of  this  result  is  that,  if  b#  is  a  vector  of  averages,  then  A'b\  = 
k\b\N  +  •  •  •  +  A/,- bj, ;Y  will  be  a  scalar  average  and  we  can  apply  a  scalar  central  limit 
theorem  given  in  the  previous  section.  This  will  yield 

A  bw  ~  X'^N  4  Mo,  i], 

where  fiN  =E[b;v]  and  Y n=  V[bjv],  in  which  case  we  conclude  that 

Vj42(b/v  -  Pn)  4  MO,  I].  (A.  14) 

This  result  is  explained  further  in  Subsection  A. 6. 3. 
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A.6.2.  Lineal-  Transformation 

Microeconometric  estimators  can  often  be  expressed  as  \J~N{6  —  8a)  =  H.ya.y,  where 
plim  H  ;V  exists  and  a,v  has  a  limit  normal  distribution.  The  distribution  of  this  product, 
or  linear  transformation  of  aN,  can  be  obtained  directly  from  part  (ii)  of  Theorem  A.  12 
(Transformation  Theorem).  We  restate  it  in  a  form  that  arises  for  many  estimators. 

Theorem  A. 17  (Product  Limit  Normal  Rule):  If  a  vector  a,y  4  A f[p,  A]  and 
a  matrix  hn4h,  where  H  is  positive  definite,  then 

HNnN  4  J\f[Kp,  HAH'].  (A.  15) 


Theorem  A.  17  can  be  directly  applied  to  an  estimator.  For  example,  the  OLS  esti¬ 
mator 


VA(/3-/30)  =  (-x,x)  —  X'u 


-1 


Vn 


is  treated  as  the  product  of  H jV  =  (AV'X'X)" 1  and  a.y  =  V  ‘l/2X'u  and  we  find  the 
plim  of  Hjv  and  the  limit  distribution  of  ay- 

Theorem  A.  17  can  also  be  used  to  justify  replacement  of  a  limit  distribution  vari¬ 
ance  matrix  by  a  consistent  estimate  without  changing  the  limit  distribution.  If  we  have 
shown  that 


V~N  (8  —  d0)  4  Af[0,  B], 


then  it  follows  by  Theorem  A.  17  that 

Bjy1/2  X  Vn  (8  -  do)  4  Af[0, 1] 


for  any  B  y  that  is  a  consistent  estimate  for  B  and  is  positive  definite. 


A.6.3.  Limit  Variance  Matrix 

A  formal  multivariate  CLT  yields  a  notationally  cumbersome  result  such  as  (A.  14). 

1/2 

Pre multiplying  by  VA,  ~  and  applying  Theorem  A.  17,  we  can  reexpress  this  in  the  sim¬ 
pler  form 

b,v  -  PIN  4  Aqo.  V], 

where  V  =  plim  V  v  and  we  assume  b.v  and  V,v  are  appropriately  scaled  so  that  V 
exists  and  is  positive  definite. 

Different  authors  express  the  limit  variance  matrix  V  in  different  ways. 

A  general  definition  is  simply 


V  =  plimViv. 

This  is  the  most  common  way  that  results  are  presented  and  is  the  form  used  in  this 
text.  In  the  fixed  regressors  case  it  simplifies  to  V  =  lim  Vat. 
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In  microeconometrics  estimation  examples  the  matrix  \N  is  often  a  matrix  average, 
say 


\N  =  -  V  S 

Nk 


where  S,  is  a  square  matrix  that  is  a  function  of  parameters  and  data  for  the  ith  obser¬ 
vation.  Given  independence  over  i  a  law  of  large  numbers  can  usually  be  applied  so 
that  Vjv-EtVjv]  4  0.  Then 


1  N 

V  =  limE[V,v]  =lim-]TE[S,]. 


This  is  the  type  of  expression  used  by  Amemiya  (1985). 

If  the  S,  are  iid  then  E[S,  ]  =E[S]  is  the  same  for  all  observations.  So  simple  random 
sampling  leads  to  the  simpler  expression 


V  =E[S], 


a  form  used  for  example  by  Newey  and  McFadden  (1994)  and  Wooldridge  (2002). 

As  an  example,  consider  the  OLS  estimator  with  homoskedastic  error,  so  that 
Vn  (j3  —  f30)  4  A/"[0,  o'2Mxxi  ].  Then  Mxx  =  plim/V  JV  x/x;  can  be  re-expressed 
as  Mxx  =  lim  A-1  £TE[x,x'  ]  if  a  law  of  large  numbers  applies,  and  as  Mxx  =  E[xx'] 
under  simple  random  sampling. 

More  complicated  forms  of  V  arise,  such  as  the  sandwich  form  ABA'.  The  preced¬ 
ing  discussion  is  then  applied  to  each  component.  For  example,  B  =  plim  B  ,y  may  be 
expressed  as  B  =  limE[  B  y  |  or  as  B  =  E[S]  under  random  sampling  if  B  =  /V"  JT  S 


A.6.4.  Asymptotic  Distribution  and  Variance 

To  obtain  the  limit  distribution  of  an  estimator  we  work  with  the  sequence  = 
\/N(6  —  do)  for  theoretical  reasons  to  ensure  a  nonzero  variance  of  as  N  — >  oo. 
Then  the  limit  distribution  of  bN  is  a  normal  distribution,  and  many  authors  say  that  bN 
is  asymptotically  normal  and  call  the  limit  variance  matrix  the  asymptotic  variance 
of  bN. 

It  can  be  convenient  to  reexpress  results  in  terms  of  the  distribution  and  variance 
matrix  of  6  itself. 

Definition  A.18  (Asymptotic  Distribution  of  9):  If 

Vn(0-  6>o)  4  Af[0,B],  (A.  16) 

then  we  say  that  in  large  samples  0  is  asymptotically  normally  distributed 
with 

e~Af[Oo,  A_1B],  (A.  17) 

where  the  term  “in  large  samples”  means  that  N  is  large  enough  for  (A.  16)  to  be 
a  good  approximation  but  not  so  large  that  the  variance  in  (A.  17)  goes  to  zero. 
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The  result  (A.  17)  follows  from  (A.  16)  since  dividing  a  random  variable  by  s/~N 
leads  to  division  of  its  variance  by  N. 

A  shorthand  notation  is  to  implicitly  presume  asymptotic  normality  and  use  the 
following  terminology. 

Definition  A.19  (Asymptotic  Variance  of  9):  If  (A.16)  holds  then  we  say  that 
the  asymptotic  variance  matrix  of  6  is 

V[0]  =  A~'B.  (A.  18) 

Definition  A.20  (Estimated  Asymptotic  Variance  of  9):  If  (A.16)  holds  then 
we  say  that  the  estimated  asymptotic  variance  matrix  of  9  is 

V[d]  =  N~l  B,  (A.19) 

where  B  is  a  consistent  estimate  of  B. 

Some  authors  use  Avar[0]  and  Avar[  9  ]  in  Definitions  A.19  and  A.20  to  avoid  poten¬ 
tial  confusion  with  the  variance  operator  V[-].  It  should  be  clear  that  here  V[0]  means 
asymptotic  variance  of  an  estimator  since  few  estimators  in  this  book  have  closed-form 
expressions  for  the  finite-sample  variance. 

As  an  example  of  Definitions  A.18-A.20,  if  {A,}  are  iid  [/r,  a 2  ]  then  the  Lindeberg- 
Levy  central  limit  theorem  leads  to  -J~N{X^  —  n)/o  -a-  A/”[0, 1],  or  equivalently  that 
s/~NXn  —>  7V[/z,  cr2].  We  say  that  asymptotically  Xn  ~  A^[/r,cr2 / N];  the  asymptotic 
variance  of  X  y  is  a2 / N;  and  the  estimated  asymptotic  variance  of  A,v  is  s2/N,  where 
s2  is  a  consistent  estimator  of  a2  such  as  s2  =  £V(A ,  —  XN)2 /{N  —  1). 

A.6.5.  Asymptotic  Efficiency 

In  finite  samples  the  Cramer-Rao  lower  bound  for  the  variance-covariance  matrix  of 
unbiased  estimators  is  —  (E[32 \nLN /d9d9'\0a\yx .  This  result  extends  to  consistent 
estimators  that  are  asymptotically  normal. 

Definition  A.21  (Asymptotic  Efficiency):  A  consistent  asymptotically  normal 
estimator  9  of  9  is  said  to  be  asymptotically  efficient  if  it  has  an  asymptotic 
variance-covariance  matrix  equal  to  the  Cramer-Rao  lower  bound. 


A.7.  Stochastic  Order  of  Magnitude 

A  useful  notation  for  rates  of  convergence  of  sequences  of  variables  is  the  order  of 
magnitude  of  a  sequence  using  (O,  o)  notation,  or  big-O,  little-o  notation. 

A  sequence  of  nonstochastic  real  numbers  on  is  0(g(N)),  if  lim (aw/giN))  is  finite 
nonzero,  and  is  o(g(N)),  if  lim(a;v/g(A0)  is  zero.  Thus  on  is  0(g(N))  if  it  is  of  the 
same  order  of  magnitude  as  the  function  g(N)  and  is  o(g(N))  if  it  is  of  smaller  order 
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of  magnitude  than  g(N).  For  example,  (3/AO  +  (5 / A^)2  is  0(1 /N )  or  0(N~l),  as  it 
behaves  for  large  N  like  a  constant  times  N  ' 1  and  is  o(N  */2)  but  larger  than 

This  notation  has  been  extended  to  stochastic  orders  of  magnitude  of  sequences 
of  random  variables.  The  notation  becomes  (Op,  op)  notation. 


Definition  A.22  (Stochastic  Order  of  Magnitude):  A  sequence  of  random  vari¬ 
ables  bN  is  Op(g(N))  if 


and  is  op(g(N))  if 


0  <  plirn 


bN 

~g(N) 


<  oo 


plim 


bN 

~g(N) 


=  0. 


Most  often  g(N)  =  N~c  for  some  constant  c  >  0.  An  estimator  9  consistent  for  9q 
can  be  written  as  9  =  9o  +  op(  I ),  since  it  equals  Oo  plus  a  term  that  goes  to  zero  in 
probability.  An  estimator  0  that  is  additionally  root-N  consistent  for  Oo  can  be  written 
as?  =  0O  +  Op(N~1/2),  since  then  A1/??  -  90)  =  Op(  1). 


A.8.  Other  Results 

This  section  contains  some  key  finite  sample  results  on  conditional  expectation  and  on 
the  interchange  of  expectations  and  transformation. 

Theorem  (Law  of  Iterated  Expectations):  For  random  variables  Y  and  X 

E[Y]=Ex[Eyix[Y\X]], 

where  E[]  denotes  the  unconditional  or  marginal  mean  ofY.  Ex[-]  denotes  un¬ 
conditional  expectation  with  respect  to  the  marginal  cdf  of  X,  and  Eyjxl'IA-] 
denotes  conditional  expectation  with  respect  to  the  conditional  distribution  of  Y 
given  X. 

This  result  means  that  if  we  first  obtain  the  conditional  mean  of  Y  given  X,  and 
then  take  the  expected  value  over  X,  we  will  obtain  the  unconditional  mean  of  Y.  See 
Rao  (1973,  p.  97)  for  a  proof.  For  example,  if  E[w|x]  =  0  then  E[u  \  =  Ex[E[m|x]]  = 
Ex[0]  =  0. 

Theorem  (Decomposition  of  Variance):  For  random  variables  Y  and  X 

V[F]  =  Ex[Vnx[Y\X]]  +  V*[Em[F|X]], 

where  V[F]  denotes  the  unconditional  variance  of  Y,  Ex[]  denotes  uncondi¬ 
tional  expectation  with  respect  to  the  marginal  cdf  of  X,  Vy|x[T|A]  denotes  the 
conditional  variance  of  Y  given  X,  VxH  denotes  variance  with  respect  to  the 
unconditional  distribution  of  X,  and  Ey|x[-|  V]  denotes  conditional  expectation 
with  respect  to  the  conditional  distribution  of  Y  given  X. 


955 


ASYMPTOTIC  THEORY 


In  words,  the  unconditional  variance  of  Y  equals  the  sum  of  (1)  the  expected  value 
(over  X)  of  the  conditional  variance  and  (2)  the  variance  (over  X)  of  the  conditional 
mean.  A  simple  way  to  remember  this  is  to  recognize  that  the  unconditional  variance 
equals  EV  plus  VE.  See  Rao  (1973,  p.  97)  for  a  proof. 

Theorem  (Jensen’s  Inequality):  IfZ  is  a  random  variable  such  thatE[Z\  exists, 

and  g(-)  is  a  convex  function,  then 

g(E[Z])  <  E[g(Z)]. 

If  instead  g(-)  is  a  concave  function  then 

8( E[Z])  >  E[g(Z)]. 

This  result,  proved  in  Rao  (1973,  p.  58),  is  very  important  for  nonlinear  mod¬ 
els.  It  emphasizes  the  difference  between  behavior  of  the  average  individual  and 
average  behavior.  For  example,  suppose  an  exponential  model  is  appropriate,  with 
E[_y|x]  =  exp(x'/3).  Then  since  the  exponential  function  is  concave,  Jensen’s  Inequal¬ 
ity  implies  that  exp(E[x'/3])  >  E[exp(x'/3)].  The  conditional  mean  evaluated  at  the  in¬ 
dividual  with  average  characteristics  x  =  E[x]  exceeds  the  unconditional  mean  E[y]  = 
E[E[y|x]]=E[exp(x'/3)]. 


A.9,  Bibliographic  Notes 

A  classic  source  with  proofs  is  Rao  (1973,  pp.  108-130),  who  we  cite  wherever  possible.  The 
results  summarized  also  draw  heavily  on  the  books  by  Amemiya  (1985,  Chapter  3)  and  White 
(2001a). 

Graduate-level  textbooks  such  as  Greene  (2003)  provide  summaries  of  key  results.  More 
advanced  texts  by  Davidson  and  MacKinnon  (1993),  Hendry  (1995),  Ruud  f2000),  and 
Wooldridge  (2002)  provide  treatments  at  least  as  detailed  as  that  here.  Davidson  (1994)  pro¬ 
vides  a  book-length  treatment  of  stochastic  theory  for  the  econometrician.  As  already  noted  ter¬ 
minology  can  differ  across  references,  especially  in  the  use  of  Slutsky’s  Theorem  and  Cramer’s 
Theorem. 
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Making  Pseudo-Random  Draws 


In  this  appendix  we  state  the  density  or  probability  mass  functions  and  first  two  mo¬ 
ments  of  leading  univariate  distibutions  and  present  methods  to  generate  random  draws 
from  these  distributions. 

Table  B.l.  Continuous  Random  Variable  Densities  and  Moments'1 

Random  Variable 

pdf  f(x) 

Mean;  Variance 

Uniform  U[a,  b] 

1  l(b -a). 

{a+b) .  (i a—b )2 

2  ’  12 

Normal  A f[n,  or2] 

(x-u)2 

ny-e  2o-2  , 

o\l2,n 

2 

/x;  oL 

Exponential  £[X\ 

Xe~Xx,  X>0 

l/X;l/X2 

Gamma  Q  [a,b] 

r  (a)b-X  C  b 

ab\ ab1 

Beta  £>[a,b] 

r  (a+b)  a—  l/i  r\&—  1 

r(a)nz>)-1  11 

a  .  ab 

a+b’>  (a+b)2(a+b+ 1) 

Logistic  £[a,b] 

x—a  x—a 

e~  b  /[b(  1  +  e~  b  )2],  — oo  <  a  <  o o 

a;  ( bn)2 /?> 

Chi-Square  x2(n) 

xn/2-le~x/2 

r(n/2)2 "/2 

n ;  2  n 

t  t(v) 

p/  n+1  \  2  P~l~  1 

f(x)  —  r(|)v^(l  +  „ )  - 

0;-^,  for  n  >  2 

F  F(w,  v) 

rc^Xv/wy/2 
/ut-  r(f)r(|)  x 

W+V 

X  (x  +  ^)“  2 

V  W  7 

-^2,  for  v  >  2; 

2v\v+w-2)  f  4 

w(v— 4)(v—2)2  ’ 

a  All  parameters  are  restricted  as  follows:  b  >  a  for  the  Uniform;  [i  unrestricted,  a2  >  0  for  the  Normal;  k  >  0 
for  the  Exponential;  a,  b  >  0  for  the  gamma;  a,b  >  0  for  the  Beta;  a  unrestricted  and  b  >  0  for  the  Logistict; 
v  is  an  integer  for  the  t -distribution;  for  the  F -distribution  v  and  w  must  be  integers. 
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Table  B.2.  Continuous  Random  Variable  Generators 


Random  Variable  Range  of  Values  Random  Variable  Generator 


Uniform  U[a,  b] 


a  <  x  <  b 


x  =  a  +  (b  —  a)r ,  r  ~  U[ 0,  1] 


X|  =  fi  +  cr  V— 21n(ri)cos(2jrr2) 

X2  =  /x  +  (xV— 21n(n)  sin(27rr2) 

[ri ,  r2  ~  U[ 0,  1];  the  resulting  pair  X\  and  x^  are  independent  random  variables.] 


Normal  Af[n,  a1]  —  oo  <  x\,  X2  <  oo 


Exponential  £  [A]  0  <  x  <  oo 


x  =  -iln(r) 


(i)  x  =  —  ]n(n“=|r,  )  or 

Gamma  CJ[a,b]  0  <  x  <  oo  x  =  ^“=l  £,■ 

(o')  x  =  [ln(n”  .r,)  -  >’2] 

(t)  r i  ~  W[0,  1];  a  is  integer.  E,s  are  iid  exponential  random  variates. 
When  a  —  1 ,  we  have  an  exponential  random  variable 
(ii)  a  is  non-integer,  a  —  in  +  q ,  0  <  q  <  1,  m  —  integer, 
y i,  j2  are  independent  B(q ,  1  —  <7)  and  5(1). 


Beta  £>[a,b] 


0  <  x  <  1 


I  (0  X  =  yi/Cyi  +  V2) 

|  (ii)  x  =  r[  Hr [  +  r2‘),  (rf  +  r2" )  <  1 
(  (i)  a,  /?  are  integers,  yi  is  Q(k,  a),  y2  is  Q(k,  b). 
k  can  be  chosen  arbitrarily. 

(ii)  a,  b  are  non-integer  r,  ~  U[0.  1];  successive  pairs  of  r\  and  r 2  are 

1  J_ 

generated  until  (rj‘  +  r£ )  <  1 . 


x  —  a  +  b\n(^) 


Logistic  £[a,b]  —  00  <  x  <  00 

[r~U[  0,  1] 

Chi-Square  /2(/r)  0  <  x  yf 

[n  is  an  integer;  y,-s  are  independent  J\f(0,  1).] 


t  t(v)  —  00  <  x  <  00  x  =  yx/s/yijv 

|yi  is  A/”(0,  1);  y2,  independent  of  y\,  is  x2(u).] 


F  F(w,  v )  0  <  x  x  =  (vi/w)/(y2/v) 

[  j2,  and  yi,  are  independent  /2(i>),  x2(w)  respectively.] 
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Table  B.3.  Discrete  Random  Variable  Probability  Mass  Functions  and  Moments 


Random  Variable  " 

pmf f(x) 

Mean;  Variance 

Binomial  Bi[n,  p) 

(  nx  )  Px(  1  -  P)n~x 

np\ np{  1  —  p) 

Poisson  V[X\ 

e~kXx/x\ 

X;X 

Negative  binomial  NB[n ,  p] 

("+r‘  )p-('-rr 

n(  1  -  p)  n(  1  -  p) 
P  '  P2 

a  For  the  binomial  0  <  P  <  1  and  n  is  a  positive  integer;  for  the  Poisson  A.  >  0;  and  for  the  negative  bino¬ 
mial  0  <  p  <  l,  n  >  l. 


Table  B.4.  Discrete  Random  Variable  Generators 


Random  Variable 

Range  of  Values 

Random  Variable  Generator 

set  x  —  0; 

do  the  loop  n  times 

Binomial  Bi(n,  p) 

A  =  0,  1,  .  . 

. ,  n 

generate  r  uniform  on  [0,1] 
if  r  <  p,  then  x  =  x  +  l 

output  X 

set  x  =  0;  t  —  0 
do  the  loop  until  t  <  X 

Poisson  V{X) 

x  =  0,  1 , . . 

generate  exponential  random  variable  y 
set  t  =  t  +  y 
x  =  x  +  1 

output  X 

Negative  binomial 
NB(n ,  p) 

x  =  0,  1 , . . 

generate  X  from  Q(n,  ^-y-) 
generate  x  from  V(X) 
output  X 
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ACD.  See  average  completed  duration 
acronyms,  17 

AD  estimator.  See  average  derivative 
adaptive  estimator,  323,  328,  684 
adding-up  constraints,  210 
additive  model,  323,  327,  523 
additive  random  utility  model  (ARUM) 
binary  outcome  models,  476-8 
generalized  random  utility  models,  515-6 
identification,  504 

multinomial  outcome  models,  504-7 
nested  logit  model,  509,  526-7 
RPL  model,  513 
welfare  analysis  in,  506-7 
admissible  estimator,  435 
AFT.  See  accelerated  failure  time 
aggregated  data 

binary  outcomes,  480-2 
cohort-level,  772 
nonlinear  models,  482,  487 
multinomial  outcomes,  513 
time-aggregated  durations,  578,  600-3 
see  also  discrete-time  duration  data 

AIC.  See  Akaike  information  criterion 

AID.  See  average  interrupted  duration 

Akaike  information  criterion  (AIC),  278-9,  284,  624 
almost  sure  convergence,  947-8 
analog  estimator,  135 
analogy  principle,  135 

and  method  of  moments  estimators,  167 
analysis  of  covariance,  733 
analysis  of  variance,  733 
Anscombe  residual,  289 


antithetic  sampling,  408-9,  445 
applications  with  data 

competing  risks  models,  658-62 
duration  models,  603-8,  632-6 
IV  estimation,  110-2 
kernel  regression,  295-7,  300 
logit  and  probit  models,  464-6,  486 
multinomial  and  nested  logit  models,  491-5,  511 
Poisson  and  negative  binomial  models,  671^4,  690 
panel  fixed  and  random  effects  estimation,  708-15 
panel  GMM  linear  estimation,  754-6 
panel  nonlinear  estimation,  792-5 
quantile  regression,  88-90 
selection  and  two-part  models,  553-6,  565 
survival  function,  574-5,  582 
treatment  evaluation  estimation,  889-96 
see  also  data  sets  used  in  applications 
Archimedean  family,  654 
Arellano-Bond  estimator,  765-6,  777 
application,  754—6 
nonlinear  models,  791 
unit  roots,  768 

ARM  A.  See  autoregressive  moving  average 
artificial  nesting,  283 

ARUM.  See  additive  random  utility  model 
asymptotic  distribution,  953-4 
asymptotic  efficiency,  954 
asymptotic  normal  distribution,  953 
definition,  74,  120,  953 
estimated  asymptotic  variance,  954 
of  extremum  estimators,  127-31 
of  FGLS  estimator,  82-3 
of  FGNLS  estimator,  156-7 
of  first-differences  estimator,  730-1 
of  fixed  effects  estimator,  727-9 
of  GMM  estimator,  173^1,  182-3,  185-6,  194-5, 
745-6 

of  Hausman  test  statistic,  271 — 4 
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of  kernel  density  estimator,  301-2,  330-1 
of  kernel  regression  estimator,  313,  331-3 
of  LM  test  statistic,  235,  237-8 
of  LR  test  statistic,  235,  237 
of  m-estimators,  1 19-21 
of  MD  estimator,  292 
of  ML  estimator,  142-3 
of  MM  estimator,  134,  174 
of  MSL  estimator,  394-5 
of  MSM  estimator,  400-2 
of  m-test  statistics,  260,  263 
of  NLS  estimator,  152^1 
of  NL2SLS  estimator,  195-6 
of  OIR  test  statistic,  181,  183 
of  OLS  estimator,  73-4,  80-1 
of  panel  GMM  estimator,  745-6 
of  quasi-ML  estimator,  146 
of  random  effects  estimator,  735 
of  Wald  test  statistic,  226-8 
see  also  asymptotic  theory 
asymptotic  efficiency,  954 
of  optimal  GMM,  177 
asymptotic  refinement,  359,  371-2 
by  bootstrap,  256,  363-7,  371-2,  378-9 
definition,  359 

by  Edgeworth  expansion,  371-2 
by  nested  bootstrap,  374,  379 
asymptotic  theory  definitions,  943-55 
asymptotic  distribution,  953 
asymptotic  variance,  954 
central  limit  theorems,  949-52 
consistency,  945 

convergence  in  distribution,  948-9 
convergence  in  probability,  944-7 
laws  of  large  numbers,  947-8 
limit  distribution,  948 
limit  variance,  952-3 
stochastic  order  of  magnitude,  954 
summary  of  definitions  and  theorems,  944 
asymptotic  variance,  74,  120,  954 

estimated  asymptotic  variance,  74,  954 
see  also  asymptotic  distribution 
asymptotically  pivotal  statistic,  359-60,  363^4,  366, 
372,  374,  379-80 
ATE.  See  average  treatment  effect 
ATET.  See  average  treatment  effect  on  the  treated 
attenuation  bias,  903-5,  911,  915,  919-20 
attrition  bias,  739,  800-1,  940 
augmented  regression  model,  429 
autocorrelation 

in  panel  model  errors,  705-8,  714-5,  722-5,  745-6 
dynamic  panel  models,  763-8,  791-2,  797-9, 
806-7 

see  also  panel-robust  inference 
autoregressive  moving  average  (ARMA)  errors 
definition,  159 
NLS  estimator,  159 
panel  data,  722-5,  729 


auxiliary  model,  404 
auxiliary  regression 
bootstrapping,  379,  382 
example,  241-3,  269-71 
Hausman  test,  276,  718-9 
LM  test,  240-1,  274 
m-test,  261-4,  544 

available  case  analysis.  See  pairwise  deletion 
average  completed  duration  (ACD),  626 
average  derivative  (AD)  estimator 
definition,  326 
uses,  317,  483 

average  interrupted  duration  (AID),  626 
average  selection  bias,  868 
average  squared  error,  315 
average  treatment  effect  (ATE),  33-4,  866-71 
definition,  866 
difficulties  estimating,  866 
local  ATE,  883-6 
matching  estimators,  871-8 
potential  outcome  model,  33 — 4 
selection  on  observables  only,  868-9 
selection  on  unobservables,  868-71 
see  also  ATET;  LATE;  MTE 
average  treatment  effect  on  the  treated  (ATET), 
866-78 

application,  889-6 
definition,  866 
difficulties  estimating,  866 
matching  estimators,  871-8,  894-6 
selection  on  observables  only,  868-9 
selection  on  unobservables,  868-71 
see  also  ATE;  LATE;  MTE 
averaged  data.  See  aggregated  data 

backward  recurrence  time,  626 
balanced  bootstrap,  374 
balanced  repeated  replication,  855 
balancing  condition,  864,  893 — 4 
bandwidth,  299,  307,  312 

bandwidth  choice  for  kernel  density  estimator,  302 — 4 
cross  validation,  304 
example,  296-7 
optimal,  303,  306 
Silverman’s  plug-in  estimate,  304 
bandwidth  choice  for  kernel  regression  estimator, 
314-6 

cross  validation,  314-6 
example,  297,  316 
optimal,  314,  318 
plug-in  estimate,  314 
baseline  hazard,  591 
in  AFT  model,  592 

identification  in  mixture  models,  618-20 
in  multiple  spells  models,  655-6 
in  PH  model,  591,  596-7,  601-2 
Bayes  factors,  456-8 
Bayes  rule.  See  Bayes  theorem 
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Bayes  theorem,  421 
example,  422-4,  435-9 
Bayesian  central  limit  theorem,  433 
Bayesian  information  criterion  (BIC),  278,  284 
see  also  AIC 

Bayesian  methods,  419-59 
Bayes  1764  example,  458-9 
Bayesian  approach,  420-35 
binary  outcome  models,  475 
compared  to  non-Bayesian,  164,  424-5,  432-41, 
439—41 

count  models,  687 

data  augmentation,  454-5,  932-3,  935-9 
decision  analysis,  434-5 
examples,  452-4 
hierarchical  linear  model,  847 
importance  sampling,  443-5 
linear  regression,  435-43,  449-50,  452-4 
Markov  chain  Monte  Carlo  simulation,  445-54, 
935-9 

measurement  error  model,  915 
mixed  linear  model,  775 
model  selection,  456-8 
multinomial  outcome  models,  514,  519 
panel  data,  775,  809 
posterior  distribution,  421,  430^1 
prior  distribution,  425-30 
Tobit  model,  563 

BCA  method.  See  bias-corrected  and  accelerated 
before-after  comparison 
application,  890-1 
Berkson  error  model,  920 
Berkson’s  minimum  chi-square  estimator,  480-1 
Bemdt,  Hall,  Hall,  and  Hausman  (BHHH)  estimate, 
138,  241,  395 

Bemdt,  Hall,  Hall,  and  Hausman  (BHHH)  iterative 
method,  343-4 

Bernoulli  distribution,  140,  148,  468,  475,  483 
Bemstein-von  Mises  Theorem,  433,  459 
best  linear  unbiased  predictor,  738,  776 
between  estimator,  702,  736,  841 
application,  710-3 
between-group  variation,  709,  733 
between  model,  702 

BFGS  algorithm.  See  Boyden,  Fletcher,  Goldfarb,  and 
Shannon 

BHHH  estimate.  See  Berndt,  Hall,  Hall,  and  Hausman 
BHHH  method.  See  Bemdt,  Hall,  Hall,  and  Hausman 
bias-corrected  and  accelerated  (BCA)  bootstrap 
method,  360 

biased  sampling,  42-5,  626-7 

see  also  sample  selection;  endogenous  stratification 
BIC.  See  Bayesian  information  criterion 
binary  endogenous  variable,  562 
binary  outcome  models,  463-89 

additive  random  utility  model,  476-8 
aggregated  data,  480-2 
alternative-invariant  regressors,  478 


alternative-varying  regressors,  478 
choice-based  samples,  478-9 
corrected  score  estimator,  916-8 
definition,  466 
example,  464-5 
identification,  476,  483 
index  function  model,  475-6 
marginal  effects,  467,  470-1 
measurement  error  in  dependent  variable,  914 
measurement  error  in  regressors,  919 
ML  estimator,  468-9 
model  misspecification,  472 
multiple  imputation  example,  937-8 
OLS  estimator,  471 
panel  data,  795-9 
semiparametric  estimation,  482-6 
see  also  logit  models;  probit  models 
binding  function,  404-5 
bivariate  counts,  215,  685-7 
bivariate  negative  binomial  distribution,  686-7 
bivariate  ordered  probit  model,  523 
bivariate  Poisson  distribution,  686 
bivariate  Poisson-lognormal  mixture,  686 
bivariate  probit  model,  522-3 
bivariate  sample  selection  model,  547-53 
application,  553-5 
bounds,  566 

conditional  mean,  548-50 
conditional  variance,  549-50 
definition,  547 

Heckman  two-step  estimator,  550-1 
identification,  551,  565-6 
marginal  effects,  552 
ML  estimator,  548 
outcome  equation,  547 
participation  equation,  547 
semiparametric  estimator,  565-6 
versus  two-part  model,  546,  552-3 
Bonferroni  test,  230 
bootstrap  hypothesis  tests 

asymptotic  refinement,  363^1,  366-7,  371-2, 
378-9 

bootstrap  critical  value,  256,  363 
bootstrap  p-value,  256,  363 
example,  366-8 
nonsymmetrical  test,  363,  380 
power,  372-3 
symmetrical  test,  363 

without  asymptotic  refinement,  363,  367-8, 
378 

bootstrap  methods,  357-83 
asymptotic  refinement,  359,  366-7 
bias  estimate,  365 
bias-corrected  estimator,  365,  368 
clustered  data,  363,  377-8,  845 
confidence  intervals,  364-5,  368 
consistency,  369-70 
critical  value,  363 
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examples,  254-6,  366-8 
for  functions  of  parameters,  363 
general  algorithm,  360 
for  GMM,  379-80 
heteroskedastic  data,  363,  376-7 
introduction,  254-6 
for  nonsmooth  estimators,  373,  380-1 
number  of  bootstrap  samples,  361-2 
panel  data,  363,  377-8,  708,  746,  751 
p- value,  363 
recentering,  374,  379 
rescaling,  374 
sampling  methods  for,  360 
smoothness  requirements,  370 
standard  error  estimate,  362,  366 
time  series  data,  381 
variance  estimate,  362 
without  asymptotic  refinement,  358,  367-8 
see  also  bootstrap  hypothesis  tests 
bounds  identification,  29 

in  measurement  error  models,  906-8 
bounds  in  selection  model,  566 
Boy  den,  Fletcher,  Goldfarb,  and  Shannon  (BFGS) 
algorithm,  344 

CAIC.  See  consistent  Akaike  information  criterion 
calibrated  bootstrap,  374 
caliper  matching,  874,  895 
canonical  link  function,  149,  469,  783 
case-control  analysis,  479,  823 
causality,  18-38 
examples,  69-70,  98 
Granger  causality,  22 
identification  frameworks  and  strategies, 

35-3 

in  linear  regression  model,  68-9 
in  potential  outcome  models,  32-4,  862-5 
in  simultaneous  equations  model,  26-7 
in  single-equation  model,  31 
and  weighting,  820-1 
see  also  endogeneity 
cdf.  See  cumulative  distribution  function 
censored  least  absolute  deviations  (CLAD)  estimator, 
564-5,  808 

censored  models,  530-44,  579-80 
conditional  mean,  535 
count  models,  680 
definitions,  532,  579-80 
examples,  530-1,  535 
ML  estimator,  533^1 
semiparametric  estimation,  563-5 
see  also  duration  model;  selection  models;  Tobit 
models;  truncated  models 

censored  normal  regression  model.  See  Tobit  model 
censoring  mechanisms,  532,  579-80 
censoring  from  above,  532,  579 
censoring  from  below,  532,  579 
left  censoring,  532,  579,  588 


independent  censoring,  580 
interval  censoring,  579,  588 
noninformative  censoring,  580 
random  censoring,  579 
right  censoring,  532,  579,  581,  589 
sample  selection,  44-5,  547 
type  1  censoring,  579 
type  2  censoring,  580 
census  coefficient,  819 
central  limit  theorem  (CLT),  949-2 
Cramer  linear  transformation,  952 
Cramer- Wold  device,  95 1 
definition,  950 
examples  of  use,  80,  130 
Liapounov  CLT,  950 
Lindeberg-Levy  CLT,  950 
multivariate,  95 1-2 
sample  average,  949 
sampling  scheme,  131,  950 
CGF  tests.  See  chi-square  goodness-of-fit 
characteristic  function,  370,  913,  950 
chatter,  394,  410 
Chebychev’s  inequality,  946 
chi-square  goodness-of-fit  (CGF)  tests,  266-7,  270-1, 
474 

choice-based  samples,  823 
binary  outcome  models,  478-9 
see  also  endogenous  stratification 
Choleski  decomposition,  416,  448 
CL  model.  See  conditional  logit 
CLAD  estimator.  See  censored  least  absolute 
deviations 
Clayton  copula,  654 
CLT.  See  central  limit  theorem 
clustered  data,  829-53 
application,  848-53 
cluster  bootstrap,  363,  377-8,  845 
cluster-robust  inference,  707,  834,  842, 

845 

cluster  sampling,  41-2 
cluster-specific  effects,  830-2,  837-45 
comparison  to  panel  data,  831-2 
diagnostic  tests,  841 
dummy  variables  model,  840 
fixed  effects  estimator,  840-1,  843-5 
hierarchical  models,  845-8 
large  clusters,  832 
nonlinear  models,  841-5 
OLS  estimator,  75,  833-7 
quasi-ML  estimator,  150 
random  effects  estimator,  837-9,  843 
small  clusters,  832 
see  also  panel  data 
cluster-robust  standard  errors 
bootstrap,  363,  377-8,  845 
clustered  data,  834,  842 
panel  data,  706-7,  745-6,  789 
see  also  robust  standard  errors 
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cluster-specific  fixed  effects  (CSFE)  estimator, 
839-41,  843-4 
application,  848-53 
between  estimator,  840-1 
nonlinear  models,  843-4 
within  estimator,  140-1 

cluster- specific  fixed  effects  (CSFE)  model,  831,  843 
cluster- specific  random  effects  (CSRE)  estimator, 
837-9,  843^1 
application,  848-53 

cluster- specific  random  effects  (CSRE)  model,  83 1 , 
843-4 

cluster  variable,  707 

CM  tests.  See  conditional  moment 

coefficient  interpretation 

in  binary  outcome  models,  467,  473 
in  competing  risks  model,  646 
in  count  model,  669 
in  duration  models,  606-7 
in  misspecified  linear  model,  91-2 
in  multinomial  outcome  models,  493-4,  501-3 
in  nonlinear  models,  122-4,  162-3 
in  Tobit  model,  541-2 
see  also  marginal  effects 
coherency  condition,  562 
cohort-level  data.  See  pseudo  panels 
cointegration,  382,  767 
common  parameters,  801 
compensating  variation,  500-7,  512 
competing  risks  model  (CRM),  642-8,  658-62 
application,  658-62 
censoring,  642 

coefficient  interpretation,  646 
definitions,  642^1 
dependent  risks,  647-8 
exit  route,  643 
identification,  646 
independent  risks,  644-6 
ML  estimator,  644-5 
proportional  hazards,  645-6 
spell  duration,  643 

with  unobserved  heterogeneity,  647,  659 
complementary  log-log  model,  466-7,  603 
complete  case  analysis.  See  listwise  deletion 
complex  surveys,  41-2,  814-6,  853-6 
composition  methods,  415 
computational  difficulties,  350-2 
concentration  parameter,  109 
conditional  analysis,  717 
conditional  expectations,  955-6 
conditional  independence  assumption,  23,  863,  865 
definition,  863 
for  participation,  863 
given  propensity  score,  865 
selection  on  observables  only,  868 
unconfoundedness,  863 
conditional  likelihood,  139^-0,  824 
panel  models,  731-2,  782-3,  796-9,  805 


conditional  logit  (CL)  model,  500-3,  524-5 
application,  491 — 4 
definition,  500 

fixed  effects  binary  logit,  797,  844 
marginal  effects,  493,  501-3,  525 
ML  estimator,  501 
from  ARUM,  505 

see  also  multinomial  outcome  models 
conditional  ML  estimator,  731-2,  782-3,  796-9,  805, 
824 

conditional  moment  (CM)  tests,  264-5,  267-9,  319 
consistent  CM  test,  268 
in  duration  models,  632 
example,  269-7 1 
in  Tobit  model,  544 
see  also  m-tests 
conditional  mean 

squared  error  loss,  67-9 
conditional  mode 
step  loss,  68 
condition  number,  350 
conditional  quantile 

asymmetric  absolute  loss,  68 
confidence  intervals,  231-2,  316,  364-5,  368 
consistent  Akaike  information  criterion  (CAIC),  278 
consistent  test  statistic,  248 
consistency 
definition,  945 

of  extremum  estimators,  125-7,  132-3 
of  GMM  estimator,  173^1,  182 
of  m-estimator,  132-3 
of  ML  estimator,  142,  146-50 
of  NLS  estimator,  155 
of  OLS  estimator,  73,  80 
strong  consistency,  947 
weak  consistency,  947 

see  also  asymptotic  distribution;  identification; 
pseudo-true  value 

constant  coefficients  model.  See  pooled  model 
contagion,  612 
contamination  bias,  903 — 4 

contemporaneous  exogeneity  assumption,  748-9,  752, 
781 

continuous  mapping  theorem,  949 
control  function  approach,  37 
control  function  estimator,  869-70,  890 
control  group,  49 
conventions,  16-17 
convergence  criteria,  339^40,  458 
convergence  in  distribution,  948-9 
continuous  mapping  theorem,  949 
definition,  948 
limit  distribution,  948 
transformation  theorem,  949 
vector  random  variables,  949 
see  also  central  limit  theorem 
convergence  in  probability,  944-7 
alternative  modes  of  convergence,  945 
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consistency,  945 
definition,  945 
probability  limit,  945 
Slutsky’s  theorem,  945 
uniform  convergence,  126,  301 
vector  random  variables,  945 
see  also  law  of  large  numbers 
copulas,  216,  651-5 
count  example,  687 
definition,  651-2 
dependence  parameter,  653^1 
leading  examples,  654 
ML  estimator,  655 
survival  copulas,  652 

correlated  random  effects  model,  719,  786 
counterfactual,  32,  555,  861,  871 
see  also  potential  outcome  model 
count  data,  665 
examples,  665 
heteroskedasticity,  665 
right-skewness,  665 
see  also  count  models 
count  models,  665-93 
censored,  680 
application,  671-4,  690 
endogenous  regressors,  683,  687-9 
endogenous  sampling,  823 
finite  mixture  models,  678-9 
hurdle  models,  680-1 

measurement  error  in  dependent  variable,  915 

measurement  error  in  regressors,  915-8 

mixture  models,  675-7 

multivariate,  685-7 

OLS  estimator,  684 

negative  binomial  model,  675-7 

NLS  estimator,  684 

panel  data,  792-5,  802-8 

Poisson  model,  666-74 

sample  selection,  680 

semiparametric  regression,  684-5 

truncated,  679-80 

zero-inflated,  681 

covariance  matrix.  See  variance  matrix 
covariance  structures,  177,  379,  753,  766-7 
covariates.  See  regressors 
Cox  CRM  model.  See  competing  risks 
Cox  PH  model.  See  proportional  hazards 
Cox-Snell  residual,  289,  631,  633-6 
CPS.  See  Current  Population  Survey 
Cramer  linear  transformation,  952 
Cramer-Rao  lower  bound,  143,  954 

see  also  semiparametric  efficiency  bound 
Cramer’s  theorem,  949 
Cramer- Wold  device,  130,  951 
CRM.  See  competing  risks  model 
cross-equation  parameter  restrictions,  210 
cross-section  data,  47 
cross-validation,  304,  314-6,  318,  321 


CSFE  estimator.  See  cluster-specific  fixed  effects 
CSRE.  See  cluster-specific  random  effects 
cumulant,  370 

cumulative  distribution  function  (cdf ),  576 
cumulative  hazard  function 
definition,  577-8 
in  competing  risks  model,  644-5 
as  diagnostic  tool,  631-2 
in  likelihood  function,  588 
Nelson- Aalen  estimator,  582-4,  605-6,  662 
in  proportional  hazards  model,  590 
Current  Population  Survey  (CPS),  58,  814-5 
curse  of  dimensionality 

in  Bayesian  methods,  419-20 
multivariate  kernel  density  estimator,  306 
multivariate  kernel  regression  estimator,  319 
high-dimensional  integrals,  393 

data  augmentation,  454-5,  932 
imputation  step,  455,  932 
for  missing  data,  932-8 
prediction  step,  455,  933 
regression  example,  933 
data-generating  process  (dgp),  72-3,  124 
misspecified,  90,  132 
data  mining,  285-6 
data  sets.  See  microdata 
data  sets  used  in  applications 

Current  Population  Survey  Displaced  Workers 
Supplement  (McCall),  603-8,  632-6,  658-62 
fishing-mode  choice  data  (Kling  and  Herriges), 
463-6,486,491-5 

National  Longitudinal  Survey  (Kling),  1 10-2 
National  Supported  Work  demonstration  project 
(Dehejia  and  Wahba),  889-95 
Panel  Survey  of  Income  Dynamics  cross-section 
sample,  295-7,  300 

Panel  Survey  of  Income  Dynamics  panel  sample 
(Ziliak),  708-15,  754-6 
patents-R&D  panel  data  (Hausman,  Hall,  and 
Griliches),  792-5 

Rand  Health  Insurance  Experiment  expenditures, 
553-6,  565 

Rand  Health  Insurance  Experiment  medical  doctor 
contacts,  671-4,  692 

strike  duration  data  (Kennan),  574-5,  582 
Vietnam  World  Bank  Livings  Standards  Survey, 
88-90,  848-53 

see  also  applications  with  data 
data  structures,  39-62 
data  sources,  58-9 
handling  microdata,  59-61 
natural  experiments,  54-8 
observational  data,  40-8 
social  experiments,  48-54 
data  summary  approach  to  regression,  820 
Davidon,  Fletcher,  and  Powell  (DFP)  algorithm,  344, 
350-1 
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decomposition  of  variance,  955-6 
degenerate  distribution,  948 
degrees-of-freedom  adjustment,  75,  102,  138,  185-6, 
278,  841 

delta  method,  231-2 

bootstrap  alternative,  363 
density  kernel,  421 

density-weighted  average  derivative  (DWAD) 
estimator,  326 
dependent  variable,  7 1 
descriptive  approach  to  regression,  820 
deviance,  149,  244 
deviance  residual,  289,  291 
DFP  algorithm.  See  Davidon,  Fletcher,  and  Powell 
algorithm 

dgp.  See  data-generating  process 
diagnostic  tests.  See  specification  tests 
DID  estimator.  See  differences-in-differences 
differences-in-differences  (DID)  estimator,  55-7, 
768-70,  878-9 
application,  890-1 
consistency,  770 
definition,  768 
introduction,  55-7 
natural  experiments,  878 
with  controls,  878-9 
without  controls,  878 
direct  regression,  906 
disaggregated  data 

contrasted  with  aggregated  data,  5-10 
discrete  factor  models,  678 
see  also  finite  mixture  models 
discrete  outcomes.  See  binary  outcomes;  counts; 

multinomial  outcomes 
discrete-time  duration  data,  577-8,  600-3 
cumulative  hazard  function,  578 
discrete-time  proportional  hazards,  600-3 
gamma  heterogeneity,  620 
hazard  function,  578 
logit  model,  602 
ML  estimator,  601 
nonparametric  estimation,  581^4 
probit  model,  602 
survivor  function,  578 
dissimilarity  parameter,  509 
disturbance  term.  See  error  term 
double  bootstrap,  374 
dummy  endogenous  variable  model,  557 
dummy  variable  estimator,  784-5,  800,  805,  840 
see  also  LSDV  estimator 
duration  data,  573-664 
different  types,  626,  641 
duration  models,  573-664 

accelerated  failure  time,  591-2 
applications,  574-5,  583,  589,  603-8,  632-6, 
658-62 

censoring,  579-82,  587-9,  595,  642 
competing  risks,  642-8,  658-62 


cumulative  hazard  function,  577-8 
discrete  time,  577-8,  600-3 
generalized  residual,  63 1 
hazard  function,  576,  578 
key  concepts,  576-8 
mixture  models,  613-25 
ML  estimator,  587-9 
multiple  spells,  655-8 
multivariate,  648-55 
nonparametric  estimators,  580 — 4 
OLS  estimator,  590-1 
panel  data,  801-2 
parametric  models,  584-91 
proportional  hazards,  592-7 
risk  set,  581,  594 

semiparametric  estimation,  594-600,  610-2 
specification  tests,  628-32 
survivor  function,  576,  578 
time-varying  regressors,  597-600 
see  also  proportional  hazards  model 
DWAD  estimator.  See  density-weighted  average 
derivative 

dynamic  panel  models,  763-8,  791-2,  797-9, 

806-7 

Arellano-Bond  estimator,  765-6 

binary  outcome  models,  806-7 

count  models,  806-7 

covariance  structures,  766-7 

inconsistency  of  standard  estimators,  764-5 

initial  conditions,  764-5 

IV  estimators,  764-5 

linear  models,  763-8 

MD  estimator,  767 

nonlinear  models,  791-2,  797-9,  806-7 
nonstationary  data,  767-8 
transformed  ML  estimator,  766 
true  state  dependence,  763-4 
unobserved  heterogeneity,  764 
weak  exogeneity,  749 

EDF  bootstrap.  See  empirical  distribution  function 
bootstrap 

Edgeworth  expansions,  370-1 
efficient  score,  141 

Eicker- White  robust  standard  errors,  74-5,  80-1,  112, 
137, 164,  175 

see  also  heteroskedasticity  robust- standard  errors 
EM  algorithm  see  expectation  maximization 
empirical  Bayes  method,  442 
empirical  distribution  function  (EDF)  bootstrap,  360 
see  also  paired  bootstrap 
empirical  likelihood,  203-6 
empirical  likelihood  bootstrap,  379-80 
encompassing  principle,  283 
endogeneity 
definition,  92 

due  to  endogenous  stratification,  78,  824-5 
Hausman  test  for,  271-2,  275-6 
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identification  frameworks  and  strategies,  35-7 
see  also  endogenous  regressors;  exogeneity 
endogenous  regressors,  78 
binary,  557,  562 
in  count  models,  683-4,  687-9 
in  discrete  outcome  models,  473 
in  duration  models,  598 
dummy,  557,  562 
inconsistency  of  OLS,  95-6 
in  linear  panel  models,  744-63 
in  linear  simultaneous  equations  model,  23-30 
in  nonlinear  panel  models,  792 
in  potential  outcome  model,  30-3 
retums-to-schooling  example,  69-70 
in  selection  models,  559-62 
in  single-equation  models,  30 
see  also  GMM  estimator;  IV  estimator 
endogenous  sampling,  42-5,  78,  822-9,  856 
consistent  estimation,  827-9 
leading  examples,  823 
see  also  censored  models;  endogenous 
stratification;  sample  selection  models 
endogenous  stratification,  820,  826-7,  856 
equation-by-equation  OLS,  210 
equicorrelated  errors,  701,  722-4,  804 
equidispersion,  668,  670 
error  components  model.  See  RE  model 
error  components  SEM,  762 
error  components  SUR  model,  762 
error  components  2SLS  estimator,  760 
error  components  3SLS  estimator,  762 
error  term,  71,  168 
additive,  168 
nonadditive,  168 

errors-in- variables.  See  measurement  error 
estimated  asymptotic  variance,  954 
see  also  asymptotic  distribution 
estimated  prediction  error.  See  cross-validation 
estimating  equations  estimator,  13-5 
asymptotic  distribution,  134-5,  174 
clustered  data,  842 
computation,  339 
definition,  134 

generalized,  134,  790,  794,  804 
variance  matrix  estimation,  137-9 
weighted,  829 
see  also  MM  estimator 
Euler  conditions,  171,749 
exact  identification.  See  just  identification 
exchangeable  errors,  701,  804 
exhaustive  sampling,  815-6 
exogeneity,  22-3 

conditional  independence,  23 
Granger  causality,  22 
of  instrument,  106 

overidentifying  restrictions  test  for,  277 
panel  data  assumptions,  700,  748-52,  754, 

781 


strong  exogeneity,  22 
weak  exogeneity,  22 
exogenous  sampling,  42-3 
exogenous  stratified  sampling,  42,  78,  814-5,  820, 
825, 856 

exogenous  regressor.  See  exogeneity 
expectation  maximization  (EM)  algorithm,  345-7 
for  data  imputation,  930-2 
E  (Expectation)  step,  346 
for  finite  mixture  model,  623-5 
M  (Maximization)  step,  346 
compared  to  NR  algorithm,  625 
expected  elapsed  duration,  626 
experimental  data,  48-58 
control  group,  49 
natural  experiments,  54-8 
social  experiments,  48-54 
treatment  group,  49 
explanatory  variables.  See  regressors 
exponential  conditional  mean,  124,  155,  669 
coefficient  interpretation,  124,  162-3,  669 
exponential  distribution,  140,  584-6 
for  generalized  (Cox-Snell)  residual,  63 1 
exponential  family  density,  427 
conjugate  prior  for,  427-8 
see  also  linear  exponential  family 
exponential-gamma  regression  model,  616, 

633-4 

exponential-IG  regression  model,  634 
exponential  regression  model 

application  with  censored  data,  606-8,  633 
example  with  uncensored  data,  159-63 
extreme  value  distribution.  See  type  1  extreme  value 
extremum  estimator,  124-39 
asymptotic  distribution,  127-31 
consistency,  125-7 
definition,  125 
formal  proofs,  130-2 
informal  approach,  132-3 
statistical  inference,  135-9 
variance  matrix  estimation,  137-9 

factor  analysis,  650 
factor  loadings,  517,  650-1,  689 
factor  model,  517,  648,  686 
Fairlee-Gumble-Morgenstem  copula,  654 
fast  simulated  annealing  (FSA)  method,  347-8 
FD  estimator.  See  first-differences 
FE  estimator.  See  fixed  effects 
feasible  generalized  least  squares  (FGLS)  estimator, 
81-3 

asymptotic  distribution,  82 
definition,  82 
example,  84-5 
in  fixed  effects  model,  729 
in  mixed  linear  model,  775 
nonlinear,  155-8 
in  pooled  model,  720-1 
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feasible  generalized  least  squares  ( cont .) 

in  random  effects  model,  705,  734-6,  738,  837-9, 
849-51 

as  sequential  two-step  m-estimator,  201 
systems  FGLS,  208-9 

feasible  generalized  nonlinear  least  squares  (FGNLS) 
estimator,  155-8 
asymptotic  distribution,  156 
definition,  156 
example,  159-63 

as  optimal  GMM  estimator,  180-1 
systems  FGNLS,  217 

FGLS  estimator.  See  feasible  generalized  least  squares 
FGNLS  estimator.  See  feasible  generalized  nonlinear 
least  squares 

FIML  estimator.  See  full  information  maximum 
likelihood 

finite  mixture  models,  621-5 
counts,  678-9 
definition,  622 
EM  algorithm,  623-5 
latent  class  interpretation,  623 
number  of  components,  624-5 
panel  data,  786 
see  also  mixture  models 
finite-sample  bias 

of  GMM  estimator,  177 
of  IV  estimator,  108-12 
of  tests,  250-4,  262 
finite-sample  correction  term 

for  sampling  without  replacement,  817 
first-differences  (FD)  estimator,  704-5,  729-31 
application,  710-11,  714 
asymptotic  distribution,  730-1 
compared  to  FE  estimator,  731 
consistency,  730,  764 
definition,  704-5,  730 
IV  estimator,  758 

first-differences  (FD)  model,  704,  729-31,  758 
first-differences  (FD)  transformation,  783 — 4 
fixed  effects  (FE)  estimator,  704,  726-9,  756-9, 
781-5,  791-2 
application,  710-3,  792-5 
asymptotic  distribution,  727-9 
binary  outcome  models,  796-9 
clustered  data,  839-41 
compared  to  DID  estimator,  768 
compared  to  FD  estimator,  729 
as  conditional  ML  estimator,  732 
consistency,  727,  764,  781-2,  784-5 
count  models,  802-8 
definition,  704,  726,  781^1 
duration  models,  802 

dynamic  models,  764-6,  791-2,  797-9,  806-7 

as  FGLS  estimator,  729 

Hausman  test  for,  717-9 

identification,  702 

incidental  parameters,  704,  726 


inconsistency,  764,  781-2,  784-5 
IV  estimators,  758 
as  LSDV  estimator,  733 
multinomial  outcome  models,  798 
selection  models,  801 
Tobit  model,  800 

versus  random  effects,  701-2,  715-9,  788 
fixed  effects  (FE)  model,  704,  726-33,  756-9,  781-5, 
791-2 

cohort-level,  772 
clustered  data,  831,  843 
definition,  700,  726 

dynamic  models,  764-6,  791-2,  797-9,  806-7 
endogenous  regressors,  756-9 
identification,  702 
incidental  parameters,  704,  726 
marginal  effects,  702 
nonlinear  models,  781-5,  796-808,  791 
time- varying  regressors,  702 
versus  random  effects,  701-2,  715-9,  788 
see  also  fixed  effects  estimators 
fixed  coefficient,  846 

fixed  design.  See  fixed  in  repeated  samples 
fixed  in  repeated  samples,  76-7 
bootstrap  sampling  method,  360 
in  kernel  regression,  312 
Liapounov  CLT,  95 1 
Markov  LLN,  948 
Monte  Carlo  sampling  method,  25 1 
fixed  regressors.  See  fixed  in  repeated  samples 
flexible  parametric  models 
count  models,  674-5 
hazard  models,  592 
selection  models,  563 
flow  sampling,  44,  626 

forward  orthogonal  deviations  IV  estimator,  759 
forward  orthogonal  deviations  model,  759 
forward  recurrence  time,  626 
Fourier  flexible  functional  form,  321 
frailty,  612,  662 

see  also  unobserved  heterogeneity 
Frank  copula,  654 
Frechet  bounds,  653 — 4 
frequentist  approach,  421-2,  424,  439^-0 
FSA  method.  See  fast  simulated  annealing 
full  conditional  distributions,  43 1 
see  also  Gibbs  sampler 
full  information  maximum  likelihood  (FIML) 
estimator,  214 
nested  logit  model,  510-2 
nonlinear  models,  219 
functional  approach 

to  measurement  error,  901 
functional  form  misspecification,  91-2 
diagnostics  for,  272-3,  277-8 

gamma  distribution,  585-6,  614 
gamma  function,  586 
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Gaussian  quadrature,  389-90,  393,  809 
Gauss-Hermite  quadrature,  389-90 
Gauss-Laguerre  quadrature,  389-90 
Gauss-Legendre  quadrature,  389-90 
Gauss-Newton  (GN)  algorithm,  345 
example,  348 

GEE  estimator.  See  generalized  estimating  equations 
general  to  specific  tests,  285 
generalized  additive  model,  323,  327 
generalized  cross-validation,  315 
generalized  estimating  equations  (GEE)  estimator, 
790,  794,  804,  809 

generalized  extreme  value  (GEV)  distribution,  508 
see  also  nested  logit  model 

generalized  information  matrix  equality,  142,  145,  264 
generalized  inverse,  261 
generalized  IV  estimator,  187 
generalized  least  squares  (GLS)  estimator,  81-5 
asymptotic  distribution,  82 
definition,  82 
as  efficient  GMM,  179 
example,  84-5 
nonlinear,  155-8 

generalized  linear  models  (GLMs),  149-50,  155 
count  data,  683 
conditional  ML  estimator,  783 
GEE  estimator,  791 
quasi-ML  estimator,  149-50 
see  also  LEF  models 

generalized  method  of  moments  (GMM)  estimator, 
166-222 

asymptotic  distribution,  173-4,  182-3 
based  on  additional  moment  restrictions,  169, 

178-9 

based  on  moment  conditions  from  economic  theory, 
171 

based  on  optimal  conditional  moment,  179-80 
bootstrap  for,  379-80 
computation,  339 
definition,  173 

endogenous  counts,  683^4,  687-9 
with  endogenous  stratification,  827 
with  exogenous  stratification,  823^4 
examples,  167-71,  178-9 
finite-sample  bias,  177 
identification,  173,  182 
linear  IV,  183-92 
linear  systems,  211-2 
nonlinear  IV,  192-9 

one-step  GMM  estimator,  187,  196,  746,  755 
optimal  GMM,  176 

optimal  moment  condition,  179-81,  188 
optimal  weighting  matrix,  175-6 
panel  data,  744-66,  789-90,  792 
practical  considerations,  219-20 
test  based  on,  245 
two-step,  176,  187,  746,  755 
variance  matrix  estimation,  174-5 


weak  instruments,  177-8 
see  also  panel  GMM  estimator 
generalized  nonlinear  least  squares  (GNLS)  estimator. 

See  feasible  generalized  nonlinear  least  squares 
generalized  partially  linear  model,  323 
generalized  random  utility  models,  515-6 
generalized  residual,  289-90 
in  duration  models,  63 1 
in  LM  test,  239-40 
plots  of,  633-6 
generalized  Tobit  model,  548 
generalized  Weibull  distribution,  584-6 
genetic  algorithms,  341 

GEV  distribution.  See  generalized  extreme  value 
Geweke,  Hajivassiliou,  Keane  (GHK)  simulator, 
407-8 

for  MNP  model,  518 

GHK  simulator.  See  Geweke,  Hajivassiliou,  Keane 
simulator 

Gibbs  sampler,  448-50 
data  augmentation,  454-5,  933 
example,  452^4 

in  latent  variable  models,  514,  519,  563 
see  also  Markov  chain  Monte  Carlo 
GLMs.  See  generalized  linear  models 
GLS  estimator.  See  generalized  least  squares 
GMM  estimator.  See  generalized  method  of  moments 
GN  algorithm.  See  Gauss-Newton 
GNLS  estimator.  See  feasible  generalized  nonlinear 
least  squares 

Gompertz  distribution,  585-6 
Gompertz  regression  model,  606-8 
gradient  methods,  337^48 
see  also  iterative  methods 
Granger  causality,  22 
grid  search  methods,  337,  351 
grouped  data.  See  aggregated  data 

Halton  sequences,  409-10 
Hausman  test,  27 1 — 4 
applications,  719,  850-1 
asymptotic  distribution,  272 
auxiliary  regressions,  273 
bootstrap,  378 

computation,  272-3,  378,  717-9 
definition,  271-2 
for  endogeneity,  271-2,  275-6 
for  fixed  effects,  717-9,  737,  788,  839 
for  multinomial  logit  model,  503 
power,  273-4 

robust  versions,  273,  378,  718-9 
Hausman-Taylor  IV  estimator,  761 
Hausman-Taylor  model,  760-2 
Hawthorne  effect,  53 
hazard  function 

baseline  in  PH  model,  591 
cumulative  hazard,  577-8,  582^4 
definition,  576,  578 
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hazard  function  ( cont .) 
in  mixture  models,  616-8 
multivariate,  649 

nonparametric  estimator,  581,  583 
parametric  examples,  585 
piecewise  constant,  591 
see  also  duration  models 
Health  and  Retirement  Study  (HRS),  58 
Heckit  estimator.  See  Heckman  two-step  estimator 
Heckman  two-step  estimator 
application,  554 
in  Roy  model,  556 
in  selection  model,  550-1 
semiparametric  estimator,  565-6 
in  Tobit  model,  543,  567-8 
Hessian  matrix 
estimate,  137 

Newton-Raphson  algorithm,  341-2 
singular,  350-1 

heterogeneous  treatment  effects,  882,  885-7 
IV  estimator,  886-7 
LATE  estimator,  885 
RD  design,  882 
heterogeneity 
within-cell,  480 

see  also  unobserved  heterogeneity 
heteroskedastic  errors 

adaptive  estimation,  323,  328 
conditional  heteroskedasticity,  78 
definition,  78 
in  GLMs,  149-50 
in  linear  model,  84-5,  94-5 
multiplicative,  84-5,  86-7 
in  nonlinear  model,  157-63 
residuals,  289-90 
tests  for,  241,  267,  275 
Tobit  MLE  inconsistency,  538 
working  matrix  for,  82-3,  156-8 
heteroskedasticity-robust  standard  errors 
bootstrap,  379-80 
clustered  data,  834 
example,  84-5 

for  extremum  estimator,  137,  164 
intuition,  81 

forNLS  estimator,  155,  164 
for  OLS  estimator,  74-5,  80-1,  112 
panel  data,  705 
for  WLS  estimator,  83 
see  also  robust  standard  errors 
hierarchical  linear  models  (HLMs),  845-8 
Bayesian  analysis,  847 
clustered  data,  845 
coefficient  types,  846-7 
individual-specific  effects,  848 
mixed  linear  models,  774-6,  847 
panel  data,  847-8 
random  coefficients  model,  847 
two-level  model,  846 


hierarchical  models,  429 

Bayesian  analysis,  441-2,  447,  450,  514 
see  also  hierarchical  linear  models 
histogram,  298 

see  also  kernel  density  estimator 
HLM.  See  hierarchical  linear  model 
hot  deck  imputation,  929,  940 
HRS.  See  Health  and  Retirement  Study 
Huber- White  robust  standard  errors,  137,  144,  146 
see  also  robust  standard  errors 
hurdle  model,  680-1,  690 
see  also  two-part  model 
hyperparameters,  428,  847 
hypothesis  tests,  223-58 

based  on  extremum  estimator,  224-33 
based  on  ML  estimator,  233-43 
based  on  GMM  estimator,  245 
based  on  m-estimator,  244 
bootstrap,  254-6,  363-8,  372-3,  378-9 
for  common  misspecifications,  274-7,  670-1 
examples,  236,  241-3,  252-4,  254-6,  372-3 
induced  test,  230 

joint  versus  separate,  230-1,  285,  629-30 
power,  247-50,  253^1 
size,  246-7,  251-3 

see  also  LM  tests;  LR  test;  Wald  tests,  m-tests 
identification 

in  additive  random  utility  models,  504 
in  binary  outcome  models,  476,  483 
bounds  identification,  29 
definitions,  29-3 1 
in  fixed  effects  model,  702 
of  GMM  estimator,  173,  182 
just  identification,  31,  214 
in  linear  regression  model,  71-2 
in  measurement  error  models,  905-14 
in  mixture  models,  618-20 
in  multinomial  probit  model,  517 
in  natural  experiments,  57-8 
observational  equivalence,  29 
order  condition,  31,  213 
over  identification,  31,  214 
rank  condition,  3 1 

in  sample  selection  model,  551,  565,  566 
set  identification,  29 

in  simultaneous  equations  model,  29-31,  213-4 
in  single-index  models,  325 
and  singular  Hessian,  35 1 
weak  identification,  100 
see  also  identification  strategies 
identification  strategies,  36-7 
control  function  approach,  37 
exogenization,  36 

incidental  parameter  elimination,  36-7 
instrumental  variables,  37 
matching,  37 
reweighting,  37 
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identified  reduced  form,  36 
IG  distribution.  See  inverse-Gaussian 
ignorable  missingness,  927 

estimator  consistency  if  MCAR,  927 
estimator  inconsistency  if  MAR  only,  927 
problems  if  nonignorable,  940 
weak  exogeneity,  927 
ignorability  assumption,  863 

see  also  conditional  independence  assumption 
importance  sampling,  407-8,  443-5,  518 
accelerated,  409 
GHK  simulator,  407-8 
importance  sampling  density,  444 
importance  sampling  estimator,  444 
importance  weight,  445 
target  density,  444 
imputation  methods,  928-39 
data  augmentation,  454-5,  932-4 
example,  936-8 
hot  deck  imputation,  929 
listwise  deletion,  928 
mean  imputation,  928-9 
multiple  imputation,  934-5 
pairwise  deletion,  928 
regression-based  imputation,  930-2 
imputation  (I)  step,  455,  932 
IM  test.  See  information  matrix  test 
IMSE.  See  integrated  mean  squared  error 
incidental  parameters,  36 

clustered  data  FE  model,  832,  840,  844 
panel  data  FE  model,  704,  726,  781-2,  805 
inclusive  value,  510-1 
incomplete  gamma  function,  586 
incomplete  panels.  See  unbalanced  panels 
independence  of  irrelevant  alternatives,  503,  505,  527 
independent  variables.  See  regressors 
independently- weighted  IV  estimator,  192 
independently- weighted  optimal  GMM  estimator,  177 
index  function  model 

binary  outcome  model,  475-6,  482-3 
bivariate  probit  model,  522-3 
ordered  multinomial  model,  519-20 
Tobit  model,  536 
see  also  single-index  model 
indicator  function,  298 
indirect  inference,  404-5 
individual-specific  effects  model 
additive,  780 

binary  outcome  models,  795-6 
cluster- specific  effects,  830 
count  models,  802-3 
definitions,  700,  780 
duration  models,  802 
multiplicative,  780,  793 
one-way,  700 
parametric,  780 
selection  models,  801 
single-index,  780 


Tobit  models,  800-1 
two-way,  738 

see  also  FE  models;  RE  models 
induced  test,  230 

information  criteria,  278-9,  283^1 
Akaike,  278-9,  284,  624 
Bayesian,  278,  284 
consistent  Akaike,  278 
Kullback-Liebler,  147,  169,  278,  280 
Schwarz,  278,  284 
information  matrix,  142 

block-diagonal,  144,  240,  329 
information  matrix  equality,  141-2,  145 
generalized,  142,  145 
see  also  BHHH  estimate;  OPG  version 
information  matrix  (IM)  test,  265-6 
bootstrap,  378 
computation,  261-2,  378 
definition,  265 
example,  270 
power,  267 

instrumental  variables  (IV)  estimator 
alternative  estimators,  190-2 
application,  110-2 
definition,  100-1 
example,  102-3 

finite-sample  bias,  108-12,  191-2,  196 

identification,  100,  105-7 

independently-weighted  IV  estimator,  192 

jackknife  IV  estimator,  192 

LIML  estimator,  191,  214 

in  linear  model,  98-112,  183-92,  21 1-2 

linear  IV  as  GMM  estimator,  170,  186 

local  average  treatment  effects  estimator,  883-9 

in  measurement  error  models,  908-10,  912-3 

in  natural  experiments,  54-5 

in  nonlinear  models,  192-9 

in  panel  models,  764-5,  757-61 

quantile  regression,  190 

in  selection  models,  559 

split-sample  estimator,  191-2 

systems  IV  estimator,  211-2,  218-9 

in  treatment  effects  models,  883-9 

two-stage  IV  estimator,  102,  187 

two-stage  least  squares  estimator,  101-2,  187-91 

Wald  estimator,  98-9 

see  also  GMM  estimator;  panel  GMM  estimator 
instruments 

definition,  96-7,  100 
examples,  97-8 
by  exclusion  restriction,  106 
by  functional  form  restriction,  106 
invalid,  100,  105-7 
optimal,  180 

for  panel  data,  750-1,  754-6 
relevance,  108 

weak,  100,  104-12,  177-8,  191-2,  196,  751-2,  756 
see  also  instrumental  variables  estimator 
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integrated  hazard  function.  See  cumulative  hazard 
function 

integrated  mean  squared  error  (IMSE),  303 
integrated  squared  error  (ISE),  302,  314 
interval  data  models 
definition,  532-3,  579 
ML  estimator,  534-5 
interruption  bias,  626 
intraclass  correlation,  816,  831,  835-8 
inverse-Gaussian  (IG)  distribution,  614-5,  677 
inverse  law  of  probability,  421 
inverse-Mills  ratio,  540-1,  553-4 
inverse  transformation  method,  409,  412-3 
inverse- Wishart  distribution,  443,  453,  514 
irrelevant  regressors,  93 
ISE.  See  integrated  squared  error 
iterated  bootstrap,  374 
iterative  methods,  337^48 
BFGS,  344 
BHHH,  343-^t 
convergence  criteria,  339 — 40 
DFP,  344,  350-1 

expectation  maximization,  345-7,  623-5,  930-2 
fast  simulated  annealing,  347-8 
Gauss-Newton,  345,  348 
line  search,  338 

Newton-Raphson,  338-9,  341-3,  348 
numerical  derivatives,  340 
simulated  annealing,  347 
starting  values,  340,  35 1 
step  size  adjustment,  338 
IV  estimator.  See  instrumental  variables 

jackknife,  374-6 
bias  estimate,  375 
bias-corrected  estimator,  375 
example,  376 
IV  estimator,  192 
standard  error  estimate,  375,  855 
Jensen’s  inequality,  956 
jittered  data,  290 

joint  duration  distributions,  648-55 
copulas,  65 1-5 
mixtures,  650-1 

multivariate  hazard  function,  649 
multivariate  survivor  function,  649-50 
joint  limits,  767 

joint  versus  separate  tests,  230-1,  285,  629-30 
just  identification,  31,  100,  173 

Kaplan-Meier  (KM)  estimator,  581-3 
application,  575,  583,  604-5 
for  baseline  hazard,  596-7 
confidence  bands  for,  583 
definition,  581 
tied  data,  582 

kernel  density  estimator,  298-306 
alternatives  to,  306 


application,  296-7,  300 

asymptotic  distribution,  301-2,  330-1 

bandwidth  choice,  302-4 

bias,  301,330-1 

confidence  interval  for,  305 

consistency,  300 

convergence  rate,  302 

definition,  299 

derivative  estimator,  305 

examples,  252-3,  367-8 

multivariate,  305-6 

Nadaray a- Watson  kernel  regression  estimator,  312 
optimal  bandwidth,  303 
optimal  kernel,  303 
variance,  301,  331 
kernel  functions,  299-300 
comparison,  300 
definition,  299 
higher-order,  299,  306,  313 
leading  examples,  300 
optimal  for  density  estimation,  303 
properties,  299 
kernel  matching,  875,  895-6 
kernel  regression  estimator,  311-9 
alternatives  to,  319-22 
asymptotic  distribution,  313,  331-3 
bandwidth  choice,  314-6 
bias,  313,  331-2 

bootstrap  confidence  interval  for,  380-1 
boundary  problems,  309,  320-1 
conditional  moment  estimator,  317-8 
confidence  interval  for,  316 
consistency,  313 
convergence  rate,  314 
definition,  312 
derivative  estimator,  317 

introduction  to  nonparametric  regression,  307-1 1 
multivariate,  318-9 
optimal  bandwidth,  314 
optimal  kernel,  314 
undersmoothing,  380 
variance,  301,  331 
see  also  nonparametric  regression 
Khinchine’s  theorem,  948 
KLIC.  See  Kullback-Liebler  information  criterion 
KM  estimator.  See  Kaplan-Meier 
k-NN  estimator.  See  nearest  neighbors  estimator 
Kolmogorov  LLN,  80,  1 1 1,  947 
Kolmogorov  test,  267 

Kullback-Liebler  information  criterion  (KLIC),  147, 
169,  278,  280 

LAD  estimator.  See  least  absolute  deviations 
Lagrange  multiplier  (LM)  test 

asymptotic  distribution,  235,  237-8 
based  on  GMM-estimator,  245 
based  on  m-estimator,  244 
bootstrap,  379 
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comparison  with  LR  and  Wald  tests,  238-9 
computation,  239-41,  256,  274 
definition,  234-5 
examples,  236,  241-3 
for  heteroskedasticity,  241,  267,  275 
in  duration  models,  632 
interpretation,  239 — 40 
for  omitted  variables,  274 
OPG  version,  240-1 
for  random  effects,  737,  841 
score  test,  234-5 
in  Tobit  model,  544 
for  unobserved  heterogeneity,  630,  636 
see  also  hypothesis  tests 
Laplace  approximation,  390 
Laplace  distribution,  178,  541 
Laplace  transform,  577 

LATE  estimator.  See  local  average  treatment  effects 
latent  class  model,  622 
see  finite  mixture  models 
latent  variable,  475,  532 
latent  variable  models 

additive  random  utility  model,  476-8,  504-7 
binary  outcomes,  475-8 
endogenous,  560-1 
ordered  multinomial  model,  519-20 
see  also  censored  models;  truncated  models 
law  of  iterated  expectations,  955 
law  of  large  numbers  (LLN),  947-8 
definition,  947 
examples  of  use,  80,  129 
Khinchine’s  theorem,  948 
Kolmogorov  LLN,  947 
Markov  LLN,  948 
sampling  schemes,  131,  948 
strong  law,  947 
weak  law,  947 

least  absolute  deviations  (LAD)  estimator 
application,  88-90 
asymptotic  distribution,  88 
binary  outcome  models,  484 
bootstrap,  381 
censored  LAD,  564-5,  808 
definition,  87 
two-stage  LAD,  190 
see  also  quantile  regression 
least-squares  dummy  variable  (LSDV)  estimator,  704, 
732-3,  840 

least-squares  dummy  variable  (LSDV)  model,  704, 
732,  840 

least  squares  (LS)  estimators 
clustered  data,  833-7 
feasible  generalized  LS,  81-3,  155-8 
generalized  LS,  81-5,  155-8 
linear,  70-85 
nonlinear  LS,  150-9 
ordinary  LS,  70-81 
panel  data,  211,  702-3,  720-5 


systems  of  equations,  207-8,  211,  217 
see  also  FGLS;  FGNLS;  OLS;  NLS 
leave-one-out  estimate,  192,  304,  315,  375 
LEF.  See  linear  exponential  family 
length-biased  sampling,  43-4,  626 
Liapounov  CLT,  80,  131,  950 
likelihood-based  hypothesis  tests,  233-43 
comparisons  of,  235-6,  238-9 
definitions,  234-5 
examples,  236-7,  241-3 
see  also  LM  tests;  LR  tests;  Wald  tests 
likelihood  function,  139 — 41 
conditional  likelihood  function,  139,  731-2,  824 
definition,  139 
joint,  19,  824-7 
leading  examples,  140-1 
marginal,  432,  595 
partial,  594-6 

likelihood  principle,  139,  420,  433 
likelihood  ratio  (LR)  test 

asymptotic  distribution,  235,  237 

based  on  GMM-estimator,  245 

based  on  m-estimator,  244 

comparison  with  LM  and  Wald  tests,  238-9 

definition,  234 

examples,  236,  241-3 

nonnested  models,  279-83 

quasi-LR  test  statistic,  244 

uniformly  most  powerful  test,  237 

see  also  hypothesis  tests 

LIML  estimator.  See  limited  information  maximum 
likelihood 

limit  distribution,  948 

see  also  asymptotic  distribution 
limit  variance  matrix,  952-3 
definition,  952 

replacement  by  consistent  estimate,  952 
sandwich  form,  953 

limited  information  maximum  likelihood  (LIML) 
estimator,  191,  214 
Lindeberg-Levy  CLT,  80,  131,  950 
line  search,  338 

linear  exponential  family  (LEF)  models,  147-9 
conjugate  priors,  427-8 
conditional  ML  estimator,  782 
consistency,  148 
leading  examples,  148 
pseudo-R2,  288 
residuals,  289-90 
tests  based  on,  240,  268,  274-5 
see  also  generalized  linear  models 
linear  panel  estimators,  695-778 
application,  708-15,  725 
Arellano-Bond  estimator,  764-5 
between  estimator,  703 
covariance  estimator,  733 
conditional  ML  estimator,  731-2 
differences-in-differences  estimator,  768-70 
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linear  panel  estimators  ( cont .) 

error  components  2SLS  estimator,  760 
error  components  3SLS  estimator,  762 
first  differences  estimator,  704-5,  729-31 
first  differences  IV  estimator,  758 
fixed  effects  estimator,  704,  726-9 
fixed  effects  IV  estimators,  757-9 
forward  orthogonal  deviations  IV  estimator,  759 
Hausman-Taylor  IV  estimator,  761 
LSDV  estimator,  704,  732-3 
MD  estimator,  753,  76-7 
panel  bootstrap,  708,  377-8,  708,  746,  751 
panel  GMM  estimators,  744-68 
panel-robust  inference,  705-8,  722,  745-6,  751 
pooled  OLS  estimator,  702-3,  720-5 
random  effects  estimator,  705,  734-6 
random  effects  IV  estimator,  759-60 
within  estimator,  704,  726-9 
within  IV  estimator,  758 
linear  panel  models,  695-778 

analysis-of-covariance  model,  733 
application,  708-15,  725 
between  model,  702 
dynamic  models,  763-8 
endogenous  regressors,  744-63 
first  differences  model,  704,  730,  758 
fixed  effects  model,  700-2,  726-34,  757-9 
fixed  versus  random  effects,  701-2,  715-9 
forward  orthogonal  deviations  model,  759 
Hausman-Taylor  model,  760-2 
incidental  parameters  problem,  704,  726 
individual  dummies,  699 
individual-specific  effects  model,  700 
LSDV  model,  704,  732 
minimum  distance  estimator,  753,  766-7 
mean-differenced  model,  758 
measurement  error,  739,  905 
mixed  linear  models,  774-6 
pooled  model,  699,  720-5 
random  effects  differenced  model,  760-1 
random  effects  model,  700-2,  734—6,  759-60 
residual  analysis,  714-5 
strong  exogeneity,  700,  749-50,  752 
time  dummies,  699 
time-invariant  regressors,  702,  749-51 
time-varying  regressors,  702,  749-51 
two-way  effects  model,  738 
unbalanced  data,  739 
weak  exogeneity,  749,  752,  758 
within  model,  704,  758 
see  also  linear  panel  estimators 
linear  probability  model,  466-7 
linear  programming  methods,  341 
linear  regression  model 
definition,  16-17,  70-1 
linear  systems  of  equations,  207-14 
panel  data  models  as,  21 1 
seemingly  unrelated  regressions,  209-10 


simultaneous  equations,  22-31,  213-4 
systems  FGLS  estimator,  208 
systems  GLS  estimator,  208 
systems  GMM  estimator,  208 
systems  ML  estimator,  214 
systems  OLS  estimator,  211 
systems  2SLS  estimator,  212 
linearization  method,  855 
link  function,  149,  469,  783 
listwise  deletion,  60,  928 

consistency  under  MCAR,  928 
example,  936-8 

inconsistency  under  MAR  only,  928 
Living  Standards  Measurement  Study  (LSMS),  59, 
88-90,  848-53 

LLN.  See  law  of  large  numbers 
LM  test.  See  Lagrange  multiplier  test 
local  alternative  hypotheses,  238,  247-8,  254 
local  average  treatment  effects  (LATE)  estimator, 
883-9 

assumptions,  884-5 
comparison  with  IV  estimator,  885 
definition,  884 

heterogeneous  treatment  effect,  885 
monotonicity  assumption,  885 
selection  on  unobservables,  883 
Wald  estimator,  886 
see  also  ATE;  ATET;  MTE 
local  linear  regression  estimator,  320-1,  333 
local  polynomial  regression  estimator,  320-1 
local  running  average  estimator,  308,  320 
local  weighted  average  estimator,  307-8 
logistic  distribution,  476-7 
logistic  regression.  See  logit  model 
logit  model,  469-70 
application,  464—5 
as  ARUM,  477,  486-7 
clustered  data,  844 
definition,  469 

for  discrete-time  duration  data,  602 
GLM,  149 

imputation  example,  937-9 
index  function  model,  476 
marginal  effects,  470 
measurement  error  example,  919 
ML  estimator,  468-9 
multinomial  logit,  494-5,  500-3,  525 
nested  logit,  509-12,  526-7 
ordered  logit,  520 
panel  data,  795-9 
probit  model  comparison,  471-3 
random  parameters  logit,  512-6 
see  also  binary  outcome  models 
log-likelihood  function.  See  likelihood  function 
length-biased  sampling,  43-4 
log-logistic  distribution,  585-6,  592 
log-normal  distribution,  585-6,  592 
log-normal  model,  533,  545-6 
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log-odds  ratio,  470,  472 
log-sum,  510 

log-Weibull  distribution.  See  type  1  extreme  value 
long  panel,  723-5,  767 
longitudinal  data.  See  panel  data 
loss  function,  66-69 
absolute  error,  67 
asymmetric  expected  error,  67 
Bayesian  decision  analysis,  434-5 
expected,  66 

KLIC,  68,  147,  168,  278-9 
squared  error,  67-9,  156 
step,  67-8 

Lowess  regression  estimator,  320-1 
application,  297,  309-10,  712-5 
LR  test.  See  likelihood  ratio  test 
LS  estimators.  See  least  squares 
LSDV.  See  least-squares  dummy  variable 
LSMS.  See  Living  Standards  Measurement  Study 

MAR.  See  missing  at  random 
marginal  analysis  of  panel  data,  717,  787 
marginal  effects,  122 — 4 

in  binary  outcome  models,  466-5,  467,  470-1 
calculus  method,  123 
computing,  122-4 
definition,  122 
example,  162-3 
finite-difference  method,  123 
in  fixed  effects  model,  702,  788 
in  multinomial  models,  493-4,  501-3,  519-23,  525 
population-weighted,  821 
in  sample  selection  models,  552 
in  single-index  models,  123 
in  Tobit  model,  541-2 
see  also  coefficient  interpretation 
marginal  likelihood,  432,  595 
marginal  treatment  effects  (MTE)  estimator,  886 
market-level  data,  482,  513 
Markov  chain  Monte  Carlo  (MCMC)  methods, 
445-54 

convergence,  449,  458 
in  data  augmentation,  933 
examples,  452-4,  512,  687,  936-9 
Gibbs  sampler,  448-50,  514,  519,  563 
Metropolis  algorithm,  450-1 
Metropolis-Hastings  algorithm,  451-2,  512 
Markov  LLN,  77,  131,  948 
Marshall-Olkin  method,  649-51,  686 
matching  assumption,  864 
see  also  overlap  assumption 
matching  estimators,  871-8,  889-96 
application,  889-96 
assumptions,  863-5 
ATE  matching  estimator,  877 
ATET  matching  estimator,  874,  877,  894-6 
balancing  condition,  893 
caliper  matching,  874 


counterfactuals,  871 
exact  matching,  872,  891 
inexact  matching,  873 
interval  matching,  875-6 
kernel  matching,  875,  895-6 
nearest-neighbor  matching,  875,  894-6 
propensity  score  matching,  873-8,  892 
radius  matching,  876,  895-6 
selection  on  observables  only,  87 1 
stratification  matching,  875-6,  893-6 
variance  computation,  877-8,  895 
maximum  empirical  likelihood  (MEL)  estimator,  206 
maximum  likelihood  (ML)  estimator,  139^46 
asymptotic  distribution,  142-3 
conditional  ML  estimator,  731-2,  782-3,  796-9 
consistency,  142,  824 
definition,  141 

endogenous  stratification,  824-7 
example,  143-4 
exogenous  stratification,  824 
MSL  estimator,  393-8 
quasi-ML  estimator,  146-50 
regularity  conditions,  141,  145-6 
restricted,  233 
unrestricted,  233 
variance  matrix  estimation,  144 
weighted  ML  estimator,  828 
see  also  quasi-ML  estimator 
maximum  rank  correlation  estimator,  485 
maximum  score  estimator,  341,  381,  483-4,  800 
maximum  simulated  likelihood  (MSL)  estimator, 
393-8 

asymptotic  distribution,  394-5 
bias-adjusted  MSL,  396-7 
compared  to  MSM,  402-3 
count  model  examples,  677-8,  687,  689 
definition,  394 
example,  397-8 
multinomial  probit  model,  518 
number  of  simulations,  396 
random  parameters  logit  model,  522 
MCAR.  See  missing  completely  at  random 
MD  estimator.  See  minimum  distance  estimator 
mean-differenced  estimator,  783,  805-6 
mean-differenced  model,  758,  783 
mean  imputation,  928,  936-8 
mean  integrated  squared  error  (MISE),  303,  314 
mean-scaling  estimator,  783,  805-6 
mean-square  convergence,  946 
mean  substitution.  See  mean  imputation 
measurement  error 

in  cohort-level  data,  772-3 
in  dependent  variable,  913-4 
in  microdata,  46,  60 
in  panel  data,  739,  905 
in  regressors,  899-922 

see  also  measurement  error  model  estimators; 
measurement  error  models 
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measurement  error  model  estimators,  899-922 
attenuation  bias,  903-5,  911,  915,  919-20 
bounds  identification,  906-8 
corrected  score  estimator,  916-8 
IV  estimator,  908-10,  912-3 
linear  models,  900-11 
nonlinear  models,  911-20 
OLS  estimator  inconsistency,  902^1 
using  additional  moment  restrictions,  909-10 
using  instruments,  908-9 

using  known  measurement  error  variance,  902-3, 
910 

using  replicated  data,  910-1,  913 
using  validation  sample,  911 
measurement  error  models,  899-922 

attenuation  bias,  903-5,  911,  915,  919-20 
classical  measurement  error  model,  901-2 
dependent  variable  measured  with  error,  9 1 3—4 
examples,  919-20 
identification,  905-14 
linear  models,  900-1 1 
multiple  regressors,  904 
nonclassical  measurement  error,  904,  920 
nonlinear  models,  91 1-20 
panel  models,  905 
scalar  regressor,  903 
serial  correlation,  909 
variance  inflation,  904,  916 
see  also  measurement  error  model  estimators 
median  regression.  See  LAD  estimator 
MEL.  See  maximum  empirical  likelihood 
m-estimator,  118-22 

asymptotic  distribution,  120 
clustered  data,  842-3 
definition,  118-9 
sequential  two-step,  200-2 
simulated  m-estimator,  398-9 
tests  based  on,  244,  263-4 
weighted  m-estimator,  829,  856 
see  also  extremum  estimators 
method  of  moments  (MM)  estimator 
asymptotic  distribution,  134,  174 
definition,  172 
examples,  167 

see  also  estimating  equations  estimator;  GMM 
estimator 

method  of  scoring,  343,  348 
method  of  simulated  moments  (MSM)  estimator, 
399^104 

asymptotic  distribution,  400-2 
compared  to  MSL,  402-3 
definition,  400 
example,  403 
MNP  model,  497,518 
number  of  simulations,  399 
method  of  simulated  scores  (MSS)  estimator 
for  MNP  model,  519 
method  of  steepest  ascent,  344 


Metropolis  algorithm,  450-1 
Metropolis-Hastings  algorithm,  451-2,  512 
microdata  sets,  58-61 
handling,  59-61 
leading  examples,  58-9 
microeconometrics  overview,  1-17 
midpoint  rule,  388,  391-2 
minimum  chi-square  estimator,  203 

see  also  Berkson’s  minimum  chi-square  estimator 
minimum  distance  (MD)  estimator,  202-3,  753,  766-7 
asymptotic  distribution,  202 
bootstrap  for,  379-80 
covariance  structures,  766-7 
definition,  202 
equally-weighted,  202 
generalized,  222 
indirect  inference,  404-5 
OIR  test,  203 
optimal,  202,  753 
panel  data,  753,  766-7 
relation  to  GMM,  203,  753 
misclassification,  914 
MISE.  See  mean  integrated  squared  error 
missing  at  random  (MAR),  926-7 
definition,  926 

and  ignorable  missingness,  927,  932 
relation  to  MCAR,  927 
missing  completely  at  random  (MCAR), 

926-7 

definition,  927 

and  ignorable  missingness,  927 
relation  to  MCAR,  927 
missing  data,  923-41 
deletion  methods,  928 
examples,  924 
ignorable  assumption,  927 
imputation  with  models,  929-41 
imputation  without  models,  928-9 
MAR  assumption,  926-7 
MCAR  assumption,  927 
nonignorable  missingness,  927,  940 
see  also  imputation  methods 
misspecification  tests.  See  specification  tests 
mixed  estimator,  439-41 
mixed  linear  model,  774-6 
Bayesian  methods,  775 
FGLS  estimator,  775 
fixed  parameters,  774 
ML  estimator,  776 
random  parameters,  774 
restricted  ML  estimator,  776 
nonstationary  panel  data,  767-8 
prediction,  776 

see  also  hierarchical  linear  model 
mixed  logit  model,  500-3 
example,  495 
definition,  500 
see  also  RPL  model 
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mixed  proportional  hazards  (MPH)  model, 

611-25 

Weibull-gamma  mixture,  615 
see  also  mixture  models 
mixture  hazard  function,  616-8 
mixture  models,  611-25 
application,  623-6 
counts,  675-9 
durations,  611-25 
identification,  618-20 
MSL  estimator,  393-8,  687 
multinomial  outcomes,  515-6 
multiplicative  heterogeneity,  613 
specification  tests,  628-32 
see  also  finite  mixture  models;  unobserved 
heterogeneity 

ML  estimator.  See  maximum  likelihood 
MM  estimator.  See  method  of  moments 
MNL  estimator.  See  multinomial  logit 
MNP  estimator.  See  multinomial  probit 
model  diagnostics,  287-91 
binary  outcome  models,  473-4 
duration  models,  628-32 
example,  290-1 

multinomial  outcome  models,  499 
pseudo-R2  measures,  287-9,  291 
residual  analysis,  289-91 
see  also  model  selection  methods 
model  misspecification,  90-4 

see  also  endogeneity;  functional  form 

misspecification;  heterogeneity;  omitted  values; 
pseudo-true  value 
model  selection  methods 
Bayesian,  456-8 
nested  models,  278-81 
nonnested  models,  278-84 
order  of  testing,  285 

see  also  model  diagnostics;  specification  tests 
moment-based  simulation  estimators, 

398^404 

see  MSL  estimator;  MSM  estimator 
moment-based  tests.  See  m-tests 
moment  matching.  See  indirect  inference 
Monte  Carlo  integration,  391-2 
direct,  391 
example,  392 

importance  sampling,  407,  443-5 
simulators,  393^4,  406-10 
see  also  quadrature 
Monte  Carlo  studies,  250^1 
example,  25 1 — 4 
moving  average  estimator,  308 
moving  blocks  bootstrap,  373,  381 
MPH  model.  See  mixed  proportional  hazards 
MSL  estimator.  See  maximum  simulated  likelihood 
MSM  estimator.  See  method  of  simulated  moments 
MSS  estimator.  See  method  of  simulated  scores 
MTE.  See  marginal  treatment  effects 


m-tests,  260-7 1 

asymptotic  distribution,  260,  263 
auxiliary  regressions,  261-3 
bootstrap,  261,  379 

chi-square  goodness  of  fit,  266-7,  270-1, 

474 

conditional  moment  test,  264-5,  267-9,  319 
CM  test  interpretation,  268 
computation,  261-3 
definition,  260 
Hausman  test,  271^1,  717-9 
information  matrix  tests,  265-6,  270 
outer-product-of-the-gradient  form,  262 
overidentifying  restrictions  test,  181,  183,  267, 
747 

power,  268 
rank,  261 

multicollinearity,  350-1 

in  multinomial  probit  model,  517 
in  panel  model,  752 
in  sample  selection  model,  542,  551 
multilevel  models.  See  hierarchical  models 
multinomial  logit  (MNL)  model,  500-3,  525 
application,  494-5 

as  additive  random  utility  model,  505 
definition,  500 

marginal  effects,  494,  501-3,  525 
ML  estimator,  501 
panel  data,  798 

see  also  multinomial  outcome  models 
multinomial  outcome  models,  490-528 
application,  491-5 
alternative-invariant  regressors,  498 
alternative-varying  regressors,  497 
conditional  logit,  500-3,  524-5 
definition,  496-7 
identification,  504 
index  function  model,  519-20 
marginal  effects,  501-3,  524-5 
mixed  logit,  500-3 
ML  estimator,  496,  501 
multinomial  logit,  500-3,  525 
multinomial  probit,  516-9 
ordered  models,  519-20 
OLS  estimator,  471 
panel  data,  798 

random  parameters  logit,  512-6 
random  utility  model,  504-7 
semiparametric  estimation,  523-4 
multinomial  probit  (MNP)  model,  516-9 
Bayesian  Methods,  519 
definition,  516-7 
identification,  517 
ML  estimator,  518 
MSL  estimator,  518 
MSM  estimator,  518 
MSS  estimator,  518 
see  also  multinomial  outcome  models 
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multiple  duration  spells,  655-8 
fixed  effects,  656 
lagged  duration  dependence,  657 
ML  estimator,  658 
random  effects,  657 
recurrent  spells,  655 
multiple  imputation,  934-9 
estimator,  934 
examples,  935-9 
relative  efficiency,  935 
variance  of  estimator,  934-5 
multiple  treatments,  860 
multiplicative  errors 
multistage  surveys,  41-2,  814-6,  853-6 
variance  estimation,  853 
multivariate  data 

binary  outcomes,  521-3 
counts,  685-7 
durations,  640-64 
see  also  systems  of  equations 
multivariate-?  distribution,  442 

NA  estimator.  See  Nelson- Aalen 

National  Longitudinal  Survey  (NLS),  58,  110-2 

National  Longitudinal  Survey  of  Youth  (NLSY), 

58-9 

National  Supported  Work  (NSW)  demonstration 
project,  889-95 
natural  conjugate  pair,  427-8 
natural  experiments,  32,  54-8 
definition,  54 

differences-in-differences  estimator,  55-7,  768-70, 
878-9 

examples,  54 
exogenous  variation,  54-5 
identification,  57-8 
instrumental  variables,  54-5 
regression  discontinuity  design,  879-83 
ncp.  See  noncentrality  parameter 
nearest  neighbors  (k-NN)  estimator,  319-20 
definition,  319 
example,  308-9 
symmetrized,  308,  320 
see  also  nonparametric  regression 
nearest-neighbor  matching,  875,  894-6 
negative  binomial  distribution,  675 
negative  binomial  model,  675-7 
application,  690 
bivariate,  215,  686-7 
hurdle  model,  681 
ML  estimator,  677 
MSL  estimator,  677-8 
NB1  variant,  676 
NB2  variant,  676 
panel  data,  804,  806 

negative  hypergeometric  distribution,  806 
neglected  heterogeneity.  See  unobserved 
heterogeneity 


Nelson- Aalen  (NA)  estimator,  582-4 
application,  605-6,  662 
confidence  bands  for,  584 
definition,  582 
tied  data,  582 
nested  bootstrap,  374,  379 
nested  logit  model,  507-12,  526-7 
from  ARUM,  526-7 
definition  510-1 
different  versions  of,  511-2 
example,  511 
GEV  model,  508,  526 
ML  estimator,  510 
sequential  estimator,  510 
welfare  analysis,  510 
see  also  multinomial  models 
nested  models  278,  281 
see  also  nonnested  models 
neural  network  models,  322 
Newey-West  robust  standard  errors,  137,  175, 

723 

definition,  175 

see  also  robust  standard  errors 
Newton-Raphson  (NR)  method,  341-3 
examples,  338-9,  348 

NLFIML  estimator.  See  nonlinear  full-information 
maximum  likelihood 

NLS  estimator.  See  nonlinear  least  squares 
NLSY.  See  National  Longitudinal  Survey  of  Youth 
NL2SLS  estimator.  See  nonlinear  two-stage  least 
squares 

NL3SLS  estimator.  See  nonlinear  three-stage  least 
squares 

noise-to-signal  ratio,  903 
noncentral  chi-square  distribution,  248 
noncentrality  parameter  (ncp),  248 
nonclassical  measurement  error,  904,  920 
nongradient  methods,  337,  341,  347-8 
nonignorable  missingness,  927,  940 
attrition  bias  due  to,  940 
selection  bias  due  to,  927,  932,  940 
nonlinear  estimators 

coefficient  interpretation,  122 — 4 
extremum  estimator 
m-estimator,  118-22 
GMM  estimator,  166-222 
ML  estimator,  139-46 
NLS  estimator,  150-9 
overview,  117-22 
panel  models,  779-810 

nonlinear  full-information  maximum  likelihood 
(NLFIML)  estimator,  219 
nonlinear  GMM  estimator,  192-9 
asymptotic  distribution,  194-5 
definition,  194-5 
example,  197-8,  199,  688 
instrument  choice,  196 
NL2SLS  estimator,  196 
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optimal,  195 
panel  data,  789-90 
nonlinear  in  parameters,  27 
nonlinear  in  variables,  27 
nonlinear  IV  estimator.  See  nonlinear  GMM 
nonlinear  least  squares  (NLS)  estimator,  150-9 
asymptotic  distribution,  152-4 
consistency,  152-3 
definition,  151 
example,  155,  159-64 
time  series,  158-9 
variance  matrix  estimation,  154-5 
nonlinear  panel  estimators,  779-810 
application,  792-5 

conditional  ML  estimator,  781-2,  805 
dummy  variable  estimator,  784-5,  800,  805 
first-differences  estimator,  783-4 
fixed  effects  estimator,  783-5,  794,  796-802,  805-8 
GEE  estimator,  790,  794,  804 
mean-differenced  estimator,  783,  805-6 
mean-scaling  estimator,  783,  805-6 
ML  estimator,  785-6 
NLS  estimator,  787,  794 
panel  GMM  estimator,  789-90 
panel-robust  inference,  788-91 
quadrature,  785-6,  796,  800 
quasi-differenced  estimator,  783-4 
quasi-ML  estimator,  791 
random  effects  estimator,  785-6,  794-6,  800-1, 
803-4 

selection  models,  801 
semiparametric,  808 
nonlinear  panel  models,  779-810 
application,  792-5 
binary  outcome  models,  795-6 
conditional  mean  models,  780-1 
count  models,  792-5,  802-6 
dynamic  models,  791-2,  797-9,  806-7 
endogenous  regressors,  792 
exogeneity  assumptions,  781 
finite  mixture  models,  786 
fixed  effects  models,  781-5,  791-2 
fixed  versus  random  effects,  788 
incidental  parameters  problem,  781-2,  805 
individual-specific  effects  models,  780-1 
parametric  models,  780,  782-3,  785-7,  792 
pooled  models,  787,  794 
random  effects  models,  785-6,  792 
selection  models,  801 
semiparametric,  808 
Tobit  models,  800-1 
transition  models,  801-2 
nonlinear  regression  model,  151 
additive  error,  168,  193,  217 
nonadditive  error,  168,  193,  218 
nonlinear  systems  of  equations,  214-9 
additive  errors,  217 
copulas,  651-5 


mixtures,  650-1 
ML  estimator,  215-6 
NLFIML  estimator,  219 
NL3SLS  estimator,  219 
nonadditive  errors,  217-8 
nonlinear  panel  model,  216 
nonlinear  SUR  model,  216 
quasi-ML  estimator,  150 
seemingly  unrelated  regressions,  216 
simultaneous  equations,  219 
systems  FGNLS  estimator,  217 
systems  GMM  estimator,  219 
systems  IV  estimator,  218-9 
systems  MM  estimator,  218 
systems  NLS  estimator,  217 
nonlinear  three-stage  least  squares  (NL3SLS) 
estimator,  219 

nonlinear  two-stage  least  squares  (NL2SLS)  estimator 
asymptotic  distribution,  195-6 
definition,  195-6 
example,  199 

see  also  nonlinear  GMM  estimator 
nonnested  models 
Cox  LR  test,  279-80 
definition,  278 
example,  283-4 

information  criteria  comparison,  278-9 
overlapping,  281 
strictly  nonnested,  281 
Vuong  LR  test,  280-3 

nonparametric  bootstrap.  See  paired  bootstrap 
nonparametric  density  estimation.  See  kernel  density 
estimator 

nonparametric  maximum  likelihood  (NPML) 
estimator,  622 

nonparametric  regression,  307-22 
convergence  rate,  31 1,  314 
kernel,  311-9 
local  linear,  320 
local  weighted  average,  307-8 
Lowess,  320 

nearest-neighbors,  308-9,  319-20 
series,  321 

statistical  inference  intuition,  309-1 1 
test  against  parametric  model,  319 
see  also  semiparametric  regression 
nonrandomly  varying  coefficient,  846 
normal  copula,  654 
normal  distribution,  140 

truncated  moments,  540,  566-7 
normal  limit  product  rule.  See  Cramer  linear 
transformation 

NPML  estimator.  See  nonparametric  maximum 
likelihood 

NR  method.  See  Newton-Raphson  method 
NSW  demonstration  project.  See  National  Supported 
Work 

nuisance  parameters.  See  incidental  parameters 


1025 


SUBJECT  INDEX 


numerical  derivatives,  340,  350 
numerical  integration.  See  quadrature 

observational  data,  40-8,  814-7 
biased  samples,  42-5 
clustering,  42 

identification  strategies,  36-7 
measurement  error,  46 
missing  data,  46 
population,  40 
sample  attrition,  47 
sampling  methods,  40-4,  815-7 
sampling  units,  41,  815 
sampling  without  replacement,  816-7 
survey  methods,  41-2,  814—7 
survey  nonresponse,  45-6 
types  of  data,  47-8 
observational  equivalence,  29 
odds  ratio,  470 

see  also  posterior  odds  ratio 
OIR  test.  See  overidentifying  restrictions  test 
OLS  estimator.  See  ordinary  least  squares 
omitted  variables  bias,  92-3,  700,  716 
LM  tests  for,  274 

one-step  GMM  estimator,  187,  196 
panel,  746,  755 

see  also  two-stage  least  squares 
one-way  individual- specific  effects  model.  See 
individual-specific  effects  model 
on-site  sampling,  43,  823 
optimal  Bayesian  estimator,  434 
optimal  GMM  estimator,  176,  179-81,  187,  195 
compared  to  2SLS,  187-8 
optimal  MD  estimator,  202,  753 
OPG.  See  outer-product  of  the  gradient 
Orbit  model,  914 
order  of  magnitude,  954 
ordered  logit  model,  520,  682 
ordered  multinomial  models,  519-20 
ordered  probit  model,  520,  535 
ordinary  least  squares  (OLS)  estimator,  70-81 
asymptotic  distribution,  73-4,  80-1 
bias  in  standard  errors  with  clustering,  836-7 
binary  data,  47 1 
clustered  data,  833-7 

coefficient  interpretation  in  misspecified  model, 
91-2 

consistency  72,  80 

definition,  7 1 

example,  84-5 

finite-sample  distribution,  79 

heteroskedasticity-robust  standard  errors,  74-5,  81 

identification,  71-2 

inconsistency,  91,  95-6 

inefficiency,  80 

nonlinear,  150-9 

panel  data,  702-3,  720-5 

see  also  least  squares  estimators 


orthogonal  polynomials,  321,  329,  390 
definition  390 

orthogonal  regression  approach,  920 
orthonormal  polynomials,  321,  329,  390 
outcome  equation,  547,  867 
outer  product  (OP)  estimate,  138,  241,  395 
outer-product  of  the  gradient  (OPG)  version 
LM  test,  240-1 
m-test,  262^1 

small-sample  performance,  262 
overdispersion,  670-1,  674-6,  690 
measurement  error,  915-6 
panel  data,  794,  806 
tests  for,  67 1 

overidentification,  31,  100,  173,  176,  379-80,  747 
see  also  GMM  estimator 
overidentifying  restrictions  (OIR)  test 
asymptotic  distribution,  181,  183 
bootstrap,  379-80 
definition,  181,  267,  277 
panel  data,  747,  756 
overlap  assumption,  864,  87 1 
in  RD  design,  881 
oversampling,  41,  478-9,  814,  872 

paired  bootstrap,  360,  366-8,  376,  378 
pairwise  deletion,  928 

biased  standard  errors,  928 
panel  attrition,  739,  801 
panel  bootstrap,  377,  707,  746,  751,  789 
panel  data,  47 

panel  data  models  and  estimators,  695-810 
comparison  to  clustered  data,  831-2 
see  also  linear  panel;  nonlinear  panel 
panel  GMM  estimators,  744-68,  789-90 
application,  754—6 
Arellano-Bond  estimator,  765-6 
asymptotic  distribution,  745-6 
bootstrap,  389-90 
compared  to  MD  estimator,  753 
computation,  751-2 
definition,  745 
efficiency,  747,  756 
exogeneity  assumptions,  748-52 
instruments,  744,  747-5 1 
IV  estimators  for  FE  model,  757-9 
IV  estimators  for  RE  model,  759-60 
just-identified,  745 
nonlinear,  789-90 
OIR  test,  747,  756 
one-step  GMM  estimator,  746,  755 
overidentified,  745 
2SLS  estimator,  746,  755 
two-step  GMM  estimator,  746,  755 
variance  matrix  estimation,  751 
panel  GMM  model,  744-66 
application,  754—6 
dynamic,  763-6 
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with  individual-specific  effects,  750-62 
without  individual-specific  effects,  744-53 
see  also  panel  GMM  estimators 
panel  IV  estimators.  See  panel  GMM  estimators 
panel-robust  statistical  inference,  377,  705-7,  722, 
746,  751,788-90 
for  Hausman  test,  718 

Panel  Study  in  Income  Dynamics  (PSID),  58,  889 
parametric  bootstrap,  360 
Pareto  distribution 
of  the  first  kind,  609 
of  the  second  kind,  616 
partial  additive  model,  323 
partial  equilibrium  analysis,  53,  862,  972 
see  also  SUTVA 
partial  F-statistic,  105,  109,  111 
partial  likelihood  estimator,  594-6 
partial  ML  estimator,  140 
partial  R-squared,  104-5,  111 
partially  linear  model,  323-5,  327,  565,  684 
participation  equation,  547,  551 
Pearson  chi-square  goodness-of-fit  test,  266 
Pearson  residual,  289,  291 
peer-effects  model,  832 
percentile,  86 

percentile  method,  364-5,  367-8 
percentile-r  method,  364,  366-7 
PH  model.  See  proportional  hazards 
piecewise  constant  hazard  model,  591 
Pitman  drift,  248 

PML  estimator.  See  pseudo-ML  estimator 
Poisson  distribution,  668 
Poisson-gamma  mixture,  675 
Poisson-IG  mixture,  677 
Poisson  regression  model,  666-74 
application,  671^1,  690,  792-5,  850-3 
asymptotic  distribution  of  estimators,  668-9 
bivariate,  686 
censored  MLE,  535 
with  clustered  data,  844,  850-3 
coefficient  interpretation,  669 
definition,  668 
equidispersion,  668 
example,  117-8,  121-2 
LEF  density,  148 
measurement  error,  915-8 
mixtures,  675-9 
ML  estimator,  668 
overdispersion,  670-1 
panel  data,  792-5,  802-6 
quasi-ML  estimator,  668-9,  682-3 
truncated  MLE,  535 
underdispersion,  67 1 
zero-truncated,  680 
see  also  count  models 
polynomial  baseline  hazard,  591,  636 
pooled  cross-section  time  series  model.  See  pooled 
model 


pooled  estimators,  702-3,  720-5 
application,  710-2,  725 
FGLS  estimator,  720-1 
GEE  estimator,  790,  794 
NLS  estimator,  794 
OLS  estimator,  211,  702-3,  720-5 
WLS  estimator,  702-3,  721 
pooled  model,  699,  720-5,  787-8 
pooling  tests,  737 

population-averaged  model.  See  pooled  model 
population  moment  conditions 
for  estimation,  172 
for  testing,  260 

see  also  GMM  estimator;  MM  estimator;  m-tests 
posterior  distribution,  421,  430-4 
asymptotic  behavior,  432-4 
conditional  posterior,  43 1 
definition,  421 
expected  posterior  loss,  434 
expected  posterior  risk,  434 
full  conditional  distribution,  43 1 
highest  posterior  density  interval,  43 1 
highest  posterior  density  region,  43 1 
marginal  posterior,  430 
observed-data  posterior,  930 
posterior  density  interval,  43 1 
posterior  mean,  423,  434 
posterior  mode,  433 
posterior  moments,  430 
posterior  precision,  423 
see  also  Bayesian  methods 
posterior  odds  ratio,  456 
posterior  (P)  step,  455,  933 
potential  outcome  model,  30-4,  861-5 

see  also  treatment  effects;  treatment  evaluation 
power  of  tests,  247-50,  253^1 
bootstrapped  tests,  372-3 
conditional  moment  test,  267-9 
example,  253 — 4 
Hausman  test,  273—4 
local  alternative  hypotheses,  247-8 
uniformly  most  powerful  test,  237 
Wald  tests,  248-50 
precision  parameter,  423 

predetermined  instruments.  See  weak  exogeneity 
prediction,  66-70 
best  linear,  70 
conditional,  66 
error,  66-70 

in  linear  panel  models,  738 
in  mixed  linear  model,  774-6 
optimal,  66-70 
rotation  groups,  814 
in  structural  model,  28 
weighted,  821 
pretest  estimator,  285 
primary  sampling  units  (PSUs),  41,  815, 

845-55 
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prior  distribution,  425-30 
conjugate  prior,  427 
definition,  420 
Dickey’s  prior,  439 
diffuse  prior,  426 
flat  prior,  426 

hierarchical  priors,  428-9,  441-2 
improper  prior,  426 
informative  prior,  437-9 
Jeffreys’  prior,  426 
noninformative  prior,  425,  435-7 
normal-gamma  prior,  437 
sensitivity  analysis  for,  429-30 
see  also  Bayesian  methods 
probit  model,  470-71 
application,  465-6 

as  additive  random  utility  model,  All 
bivariate  probit,  522-3 
bootstrap  example,  254-6 
definition,  470 

discrete-time  duration  data,  602 

as  GLM,  149 

index  function  model,  476 

logit  model  comparison,  471-3 

marginal  effects,  467,  471 

ML  estimator,  470 

Monte  Carlo  study  example,  251-4 

multinomial  probit,  516-9 

ordered  probit,  520,  535 

panel  data,  795-6 

simultaneous  equations  probit,  523,  560-1 
see  also  binary  outcome  models 
probit  selection  equation,  548 
product  copula,  654 
product  integral,  578 
product  rule,  949 

see  also  Cramer  linear  transformation 
program  evaluation.  See  treatment  evaluation 
projection  pursuit  model,  323 
propensity  score,  864-5 
application,  893^4 
balancing  condition,  864,  893^4 
conditional  independence  assumption,  865 
definition,  864 
matching,  873-8,  892 
see  also  treatment  evaluation 
proportional  hazards  (PH)  model,  592-7 
application,  605-7 

baseline  survivor  function  estimator,  596-7 
coefficient  interpretation,  606-7 
competing  risks  model,  645-6 
definition,  591 
discrete-time  model,  600-3 
leading  examples,  585 
mixed  PH,  611-25 
panel  data,  802 

partial  likelihood  estimator,  594-6 
pseudo-ML  estimator  (PML).  See  quasi-ML  estimator 


pseudo  panels,  771-3 
cohort,  77 1 

cohort  fixed  effects,  772-3 
measurement  error,  772-3 
pseudo-random  number  generators,  410-6,  957-9 
accept-reject  methods,  413-4 
composition  methods,  415 
inverse  transformation  method,  413 
leading  distributions,  957-9 
multivariate  normal,  416 
transformation  method,  413 
uniform  variates,  412 
see  also  MCMC  methods 
pseudo  R-squared  measures 

for  binary  outcome  models,  473-4 
definitions,  287-9 
example,  290-1 

for  multinomial  outcome  models,  499 
pseudo-true  value,  94,  132,  146,  281 
PSID.  See  Panel  Study  in  Income  Dynamics 
PSUs.  See  primary  sampling  units 
pure  exogenous  sampling,  825 
p-value,  226,  229,  234,  286,  363 

quadrature,  388-90 
Gaussian,  389-90 
multidimensional,  393 
in  nonlinear  panel  models,  785-6,  796,  800 
see  also  Monte  Carlo  integration 
quantitative  response  models.  See  binary  outcomes, 
multinomial  outcomes 
quantile,  86-7 
quantile  regression,  85-90 
application,  88-90 
asymmetric  absolute  loss,  68,  85 
asymptotic  distribution,  88 
bootstrap,  381 
computation,  341 
definition,  87 
IV  estimator,  190 

multiplicative  heteroskedasticity,  86-7 
quasi-difference,  783 -A 
quasi-experiment.  See  natural  experiment 
quasi-maximum  likelihood  (QML)  estimator,  146-50 
asymptotic  distribution,  146 
in  binary  outcome  models,  469 
in  clustered  models,  842-3 
definition,  146 
in  LEF,  147-9 

with  multivariate  dependent  variable,  150 
in  nonlinear  systems,  216 
in  panel  models,  768,  786 
in  Poisson  model,  668-9,  682-3 
quasi-random  numbers.  See  pseudo-random  numbers 
QML  estimator.  See  quasi-ML  estimator 

random  assignment,  49-50,  862 
see  also  sampling  schemes 
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random  coefficients  model,  94,  385,  774—6,  786 
see  also  hierarchical  models 
random  effects  (RE)  estimator,  705,  734-6,  759-62, 
785-6 

application,  710-1,  725 
asymptotic  distribution,  735 
clustered  data,  837-9,  843^1 
consistency,  699,  764 
definition,  705,  734 
error  components  2SLS  estimator,  760 
error  components  3SLS  estimator,  762 
FGLS  estimator,  734-6 
GEE  estimator,  790,  794,  804 
Hausman  test,  717-9 
incidental  parameters,  704,  726 
IV  estimators,  759-60 

ML  estimator,  736,  785-6,  794-7,  800-1,  803—4 
NLS  estimator,  787,  794 
quasi-ML  estimator,  791 
two-way  effects  model,  738 
versus  fixed  effects,  701-2,  715-9 
random  effects  (RE)  model,  700-2,  734-6,  759-62, 
785-6 

binary  outcome  models,  795-6 
Chamberlain  model,  719,  786 
clustered  data,  831,  843-4 
count  models,  794,  803—4 
definition,  700,  734 
dynamic  models,  792 
duration  models,  801-2 
endogenous  regressors,  756-7,  759-62 
Mundlak  model,  719 
nonlinear  models,  785-6 
selection  models,  801 
Tobit  model,  800-1 
two-way  effects  model,  738 
versus  random  effects,  701-2,  715-9 
see  also  hierarchical  models;  random  effects 
estimator 

random  number  generators.  See  pseudo-random 
numbers 

random  parameters  logit  (RPL)  model,  512-6 
Bayesian  methods,  514 
definition,  513 
ML  estimator,  513—4 

random  parameters  model.  See  random  coefficients 
model 

random  utility  models.  See  ARUM 
randomization  bias,  53,  867 
randomized  experiment,  50-3 

National  Supported  Work  demonstration  project, 
889 

randomized  trials,  49-53 

randomly  varying  coefficient,  847-8 

rank  condition  for  identification,  31,  182,  214 

rank-ordered  logit  model,  521 

rank-ordered  probit  model,  521 

raw  residual,  289,  291 


RD  design.  See  regression  discontinuity  design 
receiver  operators  characteristics  (ROC)  curve,  474 
reduced  form,  21,  25,  213 
see  also  structural  model 
RE  estimator.  See  random  effects 
regression-based  imputation,  930-2 
EM  algorithm,  932 
nonignorable  missingness,  932 
regression  discontinuity  (RD)  design,  879-83 
fuzzy  RD  design,  882 
heterogeneous  treatment  effects,  882 
RD  estimator,  882-3 
sharp  RD  design,  880-1 
treatment  assignment  mechanism,  879-81 
regressors,  7 1 

alternative-varying,  478,  497-8 

endogenous,  23-33 

fixed,  76-7 

irrelevant,  93 

omitted,  92-3 

stochastic,  77 

time- varying,  597-600,  702,  749-51 
see  also  endogenous  regressors 
regularity  conditions  for  ML,  141-2,  151-6 
relative  risk,  470,  503 
reliability  ratio,  903 
renewal  function,  626 
renewal  process,  626,  638 
repeated  cross  section  data,  47,  770-3 
see  also  differences-in-differences 
repeated  measures.  See  panel  data 
replicated  data,  910-1,  913 
RESET  test,  277-8 
residual  analysis 
definitions,  289-90 
duration  data,  633-6 
example,  290-1 
panel  data,  714-5 
small-sample  correction,  289 
residual  bootstrap,  361 
response-based  sampling,  43 
restricted  ML  estimator,  233,  776 
revealed  preference  data,  498,  516 
ridge  regression  estimator,  440 
Robinson  difference  estimator,  324-5,  565 
robust  sandwich  variance  matrix  estimate.  See 
sandwich  variance  matrix 
robust  standard  errors 
bootstrap,  362-3,  376-8 
Eicker- White,  74-5,  80-1,  112,  137 
for  extremum  estimator,  137-9 
Huber-White,  137,  144,  146 
Newey-West,  137,  175,  723 
see  also  cluster-robust;  heteroskedasticity-robust; 
panel-robust;  systems-robust 
ROC  curve.  See  receiver  operators  characteristics 
curve 

rotating  panels,  739 


1029 


SUBJECT  INDEX 


Roy  model,  555-7,  562 
definition,  556 

dummy  endogenous  variable,  557 
Heckman  two-step  estimator,  556 
ML  estimator,  556 

panel  semiparametric  estimation,  808 
as  treatment  effects  model,  867 
RPL  model.  See  random  parameters  logit 
R-squared,  287 
pseudo,  287-9 
uncentered,  241,  263 
running  mean  estimator,  308 

SA  method.  See  simulated  annealing 
sample  attrition,  47 
sample  moment  conditions 

see  population  moment  conditions 
sample  selection  bias,  44-5 
sample  weights,  817-21,  853-6 
see  also  weighting 
sampling  schemes 
assumptions  for  OLS,  76-78 
case-control,  479,  823 
choice-based  sampling,  43,  478-9,  823 
endogenous  sampling,  42-5,  78,  822-9,  856 
endogenous  stratified  sampling,  78,  820,  825-6, 
856 

exogenous  stratified  sampling,  42,  78,  814-5,  820, 
825, 856 

fixed  in  repeated  samples,  76-7 
flow  sampling,  44,  626 
multi-stage  surveys,  41-2,  814-6,  853-6 
on-site  sampling,  43,  823 
simple  random  sampling,  41,  76-7,  816 
stock  sampling,  44,  626-7 
with  replacement,  816 
without  replacement,  816-7 
sandwich  variance  matrix 
clustered  data,  834,  842 
extremum  estimator,  132,  137-9 
GMM  estimator,  175 
ML  estimator,  144,  148 
NLS  estimator,  150 
OLS  estimator,  74 
panel  data,  705-7,  722,  746,  751 
for  Wald  test,  277 
see  also  robust  standard  errors 
Sargan  test,  277 

see  also  overidentifying  restrictions  test 
scale  parameter,  509 
scanner  data,  499 
Schwarz  criterion.  See  BIC 
SCLS  estimator.  See  symmetrically  censored  least 
squares 

score  test,  see  Lagrange  multiplier  test 
score  vector,  141 

secondary  sampling  units  (SSUs),  41,  815,  854 
seed,  411 


seemingly  unrelated  regressions  (SUR)  model, 
209-10,  216 

Bayesian  MCMC  example,  452^1 
count  data,  685 
error  components,  762 
nonlinear,  216 
selection  bias,  445 

nonignorable  missingness,  927,  932,  940 
treatment  effects  models,  867-7 1 
see  also  selection  models 
selection  models,  546-62 

bivariate  sample  selection  model,  547-53 
count  models,  680 
example,  553-5 
panel  data,  801 
Roy  model,  555-7,  867 
sample  selection,  546 
self  selection,  546 
semiparametric  estimation,  565-6 
structural  models,  558-62 
treatment  effects  model,  862-4 
versus  selection  on  observables  only,  552-3,  864, 
868-71 

versus  two-part  models,  546,  552-3 
see  also  Tobit  models 

selection  on  observables  only,  552-3,  862-4,  868-9, 
878-3,  889-96 

compared  to  selection  models,  552-3,  864,  871 
conditional  independence  assumption,  868 
control  function  estimator,  869 
definition,  868-9 
DID  estimator,  878-9 
RD  design  estimator,  879-83 
treatment  effects  model,  862-4,  889-96 
selection  on  unobservables,  552-3,  865-71,  883-9 
definition,  868 

in  treatment  effects  model,  862^4 
IV  estimators,  883-9 
Roy  model,  867 
selection  bias,  867-7 1 
selection  model,  552-3 
self- weighting  sample,  818 
SEM.  See  simultaneous  equations  model 
seminonparametric  ML  estimator,  328-9,  485 
semiparametric  efficiency  bounds,  323,  329-30,  485 
semiparametric  estimators,  322-30 
adaptive,  323 
application,  565 

average  derivative  estimator,  326 
efficiency  bounds,  323,  329-30 
nonparametric  FGLS,  328 
Robinson  difference  estimator,  324-5,  565 
semiparametric  least  squares,  327,  483 
seminonparametric  ML  estimator,  328-9,  485 
see  also  semiparametric  models 
semiparametric  heterogeneity  model,  622 
see  also  finite  mixture  models 
semiparametric  least  squares,  327,  483 
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semiparametric  ML  estimator,  328-9,  485 
semiparametric  models,  322-30 
additive  models,  327 
binary  outcome  models,  482-6 
censored  models,  563-5 
count  models,  684-5 
definition,  322 

duration  models,  594-600,  601-2 
flexible  parametric  models,  563 
heteroskedastic  linear  model,  323,  328 
identification,  325-6 
leading  examples,  322 
multinomial  outcome  models,  523 — 4 
panel  data  models,  808 
partially  linear  model,  324-5 
selection  models,  565-6 
single-index  models,  325-7 
see  also  semiparametric  estimators 
sequential  limits,  767 
sequential  multinomial  models,  520-1 
sequential  two-step  m-estimator,  200-2 
bootstrap  for,  362 

sequence  of  random  variables,  943,  945 
serial  correlation.  See  autocorrelation 
set  identification,  29 
series  estimator,  321 
for  binary  outcomes,  483 
shared  frailty  model,  662 
short  panel 
definition,  700 

statistical  inference  in,  705-8,  721-2,  746,  751,  768 
shrinkage  estimator,  440 
Silverman’s  plug-in  estimate,  304 
simple  random  sampling  (SRS),  41,  76-7,  816 
simple  stratified  sampling,  818 
Simpson’s  rule,  388-9 
simulated  annealing  (SA)  method,  347 
simulated  m-estimator,  398-9 
simulation-based  estimation  methods,  364-418 
motivating  examples,  385-6 
see  MSL,  MSM,  indirect  inference,  simulators 
simulators,  393-4,  406-10 
antithetic  sampling,  408-9 
direct,  393 
frequency,  406 
GHK,  407-8 

Halton  sequences,  409-10 
importance  sampling,  407 
smooth,  407 
subsimulator,  394 
unbiased,  394,  400 
see  also  quadrature 

simultaneous  equations  model  (SEM),  22-31,  213 — 4, 
219 

causal  interpretation,  26 
error  components,  762 
extension  to  nonlinear  models,  27 
FIML  estimator,  214 


identification,  29-31,  213^4 
LIML  estimator,  214 
nonlinear,  219 
order  condition,  213 
rank  condition,  214 
reduced  form,  25,  213 
single-equation  models,  3 1 
structural  form,  25,  213 
structural  model,  24 
2SLS  estimator,  214 
3SLS  estimator,  214 

simultaneous  equations  probit,  523,  560-1 
simultaneous  equations  Tobit,  560-1 
single-index  models,  123,  323,  325-7 
definition,  123 
identification,  325 
marginal  effects,  123 
nonlinear  panel  model,  780 
semiparametric  estimators,  325-7 
SIPP.  See  Survey  of  Income  and  Program  Participation 
size  of  test,  246-7,  251-3 
nominal  size,  25 1 
size-corrected  test,  25 1 
true  size,  251-3 
Sklar’s  theorem,  652 
Slutsky’s  Theorem,  945-6 
alternative  version,  949 
small-sample  bias.  See  finite-sample  bias 
smooth  maximum  score  estimator,  484 
smoothing  parameters,  307 
smoothing  spline  estimator,  321 
social  experiments,  32,  48-54 
advantages,  50-2 
examples,  51,  889 
limitations,  52 — 4 
randomization,  49-50 
span,  320 

specific  to  general  test,  285 
specification  tests,  259-78 
for  clustered  data,  840 
for  duration  models,  628-32 
for  endogeneity,  275-6 
for  exogeneity,  277 
for  heteroskedasticity,  275 
for  individual-specific  effects,  737 
for  omitted  variables,  274 
for  overdispersion,  670-1 
for  pooling,  737 

for  unobserved  heterogeneity,  628-32 
for  Tobit  model,  543 — 4 
see  also  m-tests;  model  diagnostics 
spherical  errors,  78 
split-sample  IV  estimator,  191-2 
SRS.  See  simple  random  sampling 
SSUs.  See  secondary  sampling  units 
stable  family  of  distributions,  621 
stable  unit  treatment  value  assumption  (SUTVA),  872 
standard  errors.  See  robust  standard  errors 
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starting  values,  340,  35 1 

state  dependence.  See  true  state  dependence 

stated  preference  data,  498,  516 

stationary  population,  40 

statistical  packages,  349 

step  size  adjustment,  338 

stochastic  order  of  magnitude,  954-5 

stock  sampling,  44,  626-7 

strata,  41,  815 

see  also  sampling  schemes;  weighting 
stratification  matching,  875-6,  893-6 
stratified  random  sampling,  76-7,  814-5 
use  of  Liapounov  CLT,  95 1 
use  of  Markov  LLN,  948 
see  also  sampling  schemes;  weighting 
strict  exogeneity.  See  strong  exogeneity 
strong  consistency,  947 
strong  exogeneity,  22 

in  panel  models,  700,  749-50,  752,  781 
structural  approach 
to  measurement  error,  901 
to  weighting,  820-1 
structural  economic  models,  28,  171 
with  selection,  558-60 
structural  form,  20,  25,  223 
structural  model,  20-31,  35-6 
based  on  economic  model,  28 
exogeneity,  22-3 
full  information,  35 
limited  information,  35 
reduced  form,  21,  25,  223 
structural  form,  20,  25,  223 
structure,  20 

see  also  simultaneous  equations  model 
structural  selection  models,  558-62 
based  on  utility  maximization,  558-60 
endogenous  regressors,  561-2 
simultaneous  equations  Tobit,  560-1 
studentized  statistic,  359 
subsampling  method,  373 
substitution  bias,  53,  867 
sufficient  statistic,  732,  782,  799,  805 
definition,  782 

summation  assumption,  748,  752 
superpopulation,  40,  816 
supersmoother,  321 

SUR  model.  See  seemingly  unrelated  regressions 
survey  methods,  41-2,  84-7,  814-8,  853-6 
survey  nonresponse,  45-6,  60,  739 
see  also  attrition  bias;  imputation  methods 
Survey  of  Income  and  Program  Participation  (SIPP), 
59 

survival  analysis.  See  duration  models 
survival  function.  See  survivor  function 
survivor  function 

aggregate  survivor  function,  619 

definition,  576-8 

estimator  in  PH  model,  596-7 


Kaplan-Meier  estimator,  581-2,  604-5 
in  mixture  models,  615-6 
multivariate,  649-50 
parametric  examples,  585 

SUTVA.  See  stable  unit  treatment  value  assumption 
switching  regressions  model.  See  Roy  model 
symmetrically  censored  least  squares  (SCLS) 
estimator,  565 

synthetic  panels.  See  pseudo  panels 
systems  of  equations,  206-19 
linear  systems,  206-14 
nonlinear  systems,  214-9 
seemingly  unrelated  regression,  209-10,  216 
simultaneous  equations  model,  22-31,  213 — 4,  219 
systems-robust  standard  errors,  208-9,  212,  219 

target  density,  444 

tests.  See  hypothesis  tests,  m-tests,  specification  tests 
three-stage  least  squares  (3SLS)  estimator,  214 
3SLS  estimator.  See  three-stage  least  squares 
time  series  data 
bootstrap,  381 
NLS  estimator,  158-9 
Newey-West  standard  errors,  137,  175,  727 
time-varying  regressors 
in  duration  models,  597-9 
in  panel  data  models,  702,  749-51 
Tobit  model,  536—44 
Bayesian  methods,  563 
censored  mean,  538-41 
censoring  mechanism,  532,  579 
consistency  of  MLE,  538 
definition,  536 
example,  530-1 
generalized,  548 

Heckman  two-step  estimator,  543,  567-8 

identification,  536 

as  imputation  method,  932 

inverse-Mills  ratio,  540-1 

marginal  effects,  541-2 

measurement  error  in  dependent  variable,  914 

ML  estimator,  537-8 

NLS  estimator,  542 

OLS  estimator,  543 

panel  data,  800-1 

simultaneous  equations,  560-1 

specification  tests,  543-4 

with  stochastic  thresholds,  547 

with  truncated  data,  538 

truncated  mean,  538-41,  566-7 

two-limit,  536 

type  2,  547 

type  5,  557 

see  also  selection  models 
top-coded  data,  532-3,  541,  563 
transformation  methods,  413 
transformation  theorem,  949 
transformed  ML  estimator,  766 
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transition  data.  See  duration  models 
trapezoidal  rule,  388 
treatment-control  comparison 
application,  890-1 

treatment  effects  framework,  862-5,  871-8,  889-96 
balancing  condition,  864,  893 — 4 
binary  treatment  variable,  862 
conditional  independence  assumption,  863,  865 
conditional  mean  independence  assumption,  864 
heterogeneous  treatment  effects,  882,  885 
multiple  treatments,  860 
overlap  assumption,  864,  87 1 
propensity  score,  864-5 
Roy  model,  867 

stable  unit  treatment  value  assumption,  872 
see  also  treatment  evaluation 
treatment  evaluation,  860-98 
application,  889-96 
IV  estimators,  883-9 
matching  estimators,  871-8 
DID  estimators,  878-9 
selection  bias,  865-71 

selection  on  observables,  862^1,  878-3,  889-96 
selection  on  unobservables,  865-71,  883-9 
regression  discontinuity  design,  879-83 
see  also  treatment  effects  framework 
treatment  group,  49,  862 
trimming,  316,  333 
tri variate  reduction,  686 
true  state  dependence 

duration  models,  612,  630,  636 
dynamic  panel  models,  763-4,  798,  802 
see  also  unobserved  heterogeneity 
truncated  models,  530-44 
conditional  mean,  535 
count  models,  679-80 
definition,  532 
examples,  530-1,  535 
ML  estimator,  534 

see  also  Tobit  model;  selection  models 
truncated  moments  of  standard  normal,  540,  566-7 
truncation  mechanisms,  532 
truncation  from  above,  532 
truncation  from  below,  532 
2SLS  estimator.  See  two-stage  least  squares 
two-limit  Tobit  model,  536 
two-part  model,  544-6 
application,  553-5 

compared  to  selection  models,  546,  552-3 
definition,  545 
example,  545-6 
see  also  hurdle  model 
two-stage  IV  estimator,  187 
two-stage  least  squares  (2SLS)  estimator,  101-2, 
187-91 

alternatives  to,  190-2 
Basmann’s  approach,  190-1 
compared  to  optimal  GMM,  187-8 


as  GLS  in  transformed  model,  188-9 
as  GMM  estimator,  187 
nonlinear,  195-6,  199 
panel  data,  746,  755 
in  SEM,  214 

Theil* s  interpretation,  189-90 
two-stage  sampling,  41,  818 
two-step  estimators 
GMM,  176,  187 

Heckman,  543,  550-1,  556,  567-8 
sequential  m-estimator,  200-2 
two-step  GMM  estimator,  176,  187 
panel,  746,  755 
two-way  effects  model,  738 
type  I  error,  246-7 
type  II  error,  246-7 

type  1  extreme  value  distribution,  477,  486-7 
duration  model  error,  590 
multinomial  logit  model,  505 
type  2  Tobit.  See  bivariate  sample  selection  model 
type  5  Tobit.  See  Roy  model 

ultimate  sampling  units  (USUs),  41,  815 
unbalanced  panels,  739 

uncentered  explained  sum  of  squares  (ESS),  241 
uncentered  R-squared,  241,  263 
unconfoundedness  assumption.  See  conditional 
independence  assumption 
underrecording,  915 
undersmoothing,  305,  333,  380 
uniform  convergence  in  probability,  126,  301 
uniform  number  generators,  412 
uniformly  most  powerful  (UMP)  test,  247 
unit  roots,  382,  767-8 
universal  logit  model,  500 
unobserved  heterogeneity 
application,  632-6 
in  competing  risks  model,  647 
in  count  models,  675-7,  686 
distributions  for,  614-5,  620-1 
in  duration  models,  611-25 
finite  mixture  models  for,  621-5 
identification,  618-20 
IM  test  for,  267 

individual-specific  effects,  700,  764 
mixture  models  for,  613-21 
MSL  example,  397-8 
MSM  example,  403 
multiplicative,  613,  686 
in  nonlinear  systems,  215 
specification  tests  for,  629-32 
variance  inflation,  614 

versus  true  state  dependence,  612,  630,  636,  763 — 4, 
798,  802 

USUs.  See  ultimate  sampling  units 

validation  sample,  911 
variance  components,  735,  845 
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variance  matrix  estimation 
BHHH  estimate,  138 

degrees-of-freedom  adjustment,  75,  102,  138, 
185-6,  278,  841 
expected  Hessian  estimate,  138 
for  extremum  estimator,  137-9 
for  GMM  estimator,  174-5 
Hessian  estimate,  138 
forNLS  estimator,  154—5 
OPG  estimate,  138 
robust  estimate,  137 
sandwich  estimate,  137,  144 
for  weighted  estimators,  854-6 
see  also  robust  standard  errors 
variance  reduction  for  simulation,  478 

Wald  estimator 

in  treatment  effects  models,  886 
Wald  test,  136-7,  224-33 

asymptotic  distribution,  226-8 
comparison  with  LM  and,  LR  tests,  238-9 
definition,  136 
examples,  236,  241-3 
exclusion  restrictions,  227 
F-test  version,  226 
introduction,  136-7 
lack  of  invariance,  232-3 
likelihood  based,  234,  241-3 
linear  models,  224-5 
linear  restrictions,  136-7 
in  misspecified  models,  229-30 
nonlinear  restrictions,  224,  229 
power,  248-50 
of  statistical  significance,  228 
t-test  version,  226-8 
see  also  hypothesis  tests 
weak  consistency,  947 
weak  exogeneity,  22 

in  panel  data,  749,  752,  758 
weak  instruments,  100,  104-12 
application,  110-2 
definition,  104 

finite  sample  bias,  108-12,  177-8,  191-2,  196 
GMM  estimator,  177-8 
inconsistency,  105-7 
indicators  104-5,  756 
panel  data,  751-2,  756 
Weibull  distribution,  584-6 
Weibull-gamma  regression  model,  615 
Weibull  regression  model,  143^1,  589,  606-8,  635 
weighted  estimation 

endogenous  stratification,  828-9 
exogenous  stratification,  818-20 


weighted  exogenous  sampling  ML  (WESML) 
estimator,  828 

weighted  least  squares  (WLS)  estimator,  81-5 
asymptotic  distribution,  83 
contrasted  with  GLS,  83 
definition,  83 
example,  84-5 
in  pooled  model,  702-3,  721 
see  also  FGLS  estimator 
weighted  maximum  likelihood  (WML)  estimator, 

828 

weighted  semiparametric  least  squares  (WSWL) 
estimator,  327 

for  binary  outcome  models,  485 
weighting,  817-21,  827-9,  853-6 

descriptive  versus  structural  approach,  820 
with  endogenous  stratification,  827-9 
sample  weights,  817-8 
variance  estimation,  853-6 
weighted  prediction,  821 
weighted  regression,  818-20 
whether  to  weight,  820-1 
welfare  analysis 
with  ARUM,  506-7 
with  nested  logit  model,  512 
WESML  estimator.  See  weighted  exogenous  sampling 
ML 

White  standard  errors.  See  robust  standard  errors 
wild  bootstrap,  377-8 
window  width,  299,  307,  312 
Wishart  distribution,  443 

see  also  inverse- Wishart  distribution 
within  estimator.  See  fixed  effects  estimator 
within  model.  See  fixed  effects  model 
within-group  variation,  709,  733 
with-zeros  model,  681 
WLS  estimator.  See  weighted  least  squares 
WML  estimator.  See  weighted  maximum  likelihood 
WNLS  estimator,  156-7 
asymptotic  distribution,  156 
definition,  156 
example,  159-63 
as  GLM,  158 
working  matrix 
definition,  82 
for  GLM  estimator,  158 
for  pooled  GEE  estimator,  794 
for  pooled  WLS  estimator,  721 
for  WLS  estimator,  82-3 

WSLS  estimator.  See  weighted  semiparametric  least 
squares 

zero-inflated  count  model,  680-1 
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